swarm-zap/memory/council-runs/2026-03-05-dp-inference-costs.md

# D/P Council Run — LLM Inference Cost Reduction

**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
**Mode**: D/P (Deterministic/Probabilistic)
**Flow**: Parallel, 1 round
**Tier**: Light (all subagents on default model — Sonnet 4.6)
**Date**: 2026-03-05 19:22 UTC
**Subagent calls**: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter)

---

## Phase 1: Ideation (Parallel)

### D-Freethinker — 4 Ideas

**1. Complexity-Gated Model Router**
Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial.

**2. Input Token Budget Enforcement with Semantic Deduplication**
Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision).

**3. Constrained Output Enforcement to Eliminate Retry Overhead**
Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%.

**4. Async Batch Offload to Self-Hosted Open-Weight Models**
Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend.

### P-Freethinker — 4 Ideas

**1. KV-Cache-Aware Prompt Architecture**
The real waste isn't in generation — it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk.

**2. Task-Aware Speculative Routing via Consequence Classification**
Don't route on complexity — route on *consequence*. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact.

**3. Semantic Request Deduplication Cache**
Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work.

**4. Inverse Prompt Engineering — Reduce Generation Entropy**
Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity.

---

## Phase 2: Assessment (Parallel)

### D-Arbiter Evaluation

| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|------|---------|-------------|--------|-------------|----------|
| D1 - Model Router | 45 | 72 | 85 | 88 | **SHORTLIST** |
| D2 - Token Budget + Cache | 38 | 78 | 62 | 82 | **SHORTLIST** |
| D3 - Constrained Output | 30 | 90 | 48 | 95 | **SHORTLIST** |
| D4 - Self-Hosted Batch | 52 | 48 | 90 | 60 | **HOLD** |

Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable.
Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques.
Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics?
Convergence: TRUE — right quadrant identified.
Overall novelty: 38. Repetition rate: 72.

### P-Arbiter Evaluation

| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|------|---------|-------------|--------|-------------|----------|
| P1 - KV-Cache Architecture | 42 | 92 | 78 | 95 | **SHORTLIST** |
| P2 - Consequence Routing | 81 | 62 | 88 | 71 | **SHORTLIST** |
| P3 - Semantic Dedup Cache | 55 | 85 | 65 | 90 | **SHORTLIST** |
| P4 - Inverse Prompt Eng | 38 | 80 | 72 | 85 | **HOLD** |

Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy.
Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content.
Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals.
Convergence: FALSE — consequence classification needs methodology, gap on batching/async.
Overall novelty: 61. Repetition rate: 44.

---

## Phase 3: Meta-Arbiter Merge

### Selected Ideas

**Primary Picks:**
1. **P1 — KV-Cache-Aware Prompt Architecture**: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic.
2. **D1 — Complexity-Gated Model Router**: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring.
3. **P2 — Consequence-Based Routing** (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap.

**Secondary Picks:**
4. **D2 — Input Token Budget + Semantic Cache**: Modular, independently deployable, 30-55% combined impact. Strong supporting role.
5. **D3 — Constrained Output**: Pure hygiene, zero downside, 15-35%. Ship alongside P1.

### Productive Merges

1. **Unified Routing Signal (D1 + P2)**: Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge.
2. **Stacked Cache Layers (D2 + P1)**: Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%.
3. **Prompt Hygiene Sprint (D3 + P1)**: Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output).

### Rejections
- P3 standalone (absorbed into Merge 2)
- D4 current phase (operational surface too large before harvesting simpler wins)
- P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams)

### Recommended Sequencing
- Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3)
- Week 2: Instrument consequence signals in production logs
- Week 3-4: Deploy semantic cache pilot on highest-volume endpoint
- Month 2: Build V1 complexity router, shadow test
- Month 3: Test unified complexity × consequence router (Merge 1)

### Confidence
Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.