# D/P Council Run — LLM Inference Cost Reduction **Topic**: Best approach to reduce LLM inference costs by 50% without quality loss? **Mode**: D/P (Deterministic/Probabilistic) **Flow**: Parallel, 1 round **Tier**: Light (all subagents on default model — Sonnet 4.6) **Date**: 2026-03-05 19:22 UTC **Subagent calls**: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter) --- ## Phase 1: Ideation (Parallel) ### D-Freethinker — 4 Ideas **1. Complexity-Gated Model Router** Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial. **2. Input Token Budget Enforcement with Semantic Deduplication** Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision). **3. Constrained Output Enforcement to Eliminate Retry Overhead** Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%. **4. Async Batch Offload to Self-Hosted Open-Weight Models** Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend. ### P-Freethinker — 4 Ideas **1. KV-Cache-Aware Prompt Architecture** The real waste isn't in generation — it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk. **2. Task-Aware Speculative Routing via Consequence Classification** Don't route on complexity — route on *consequence*. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact. **3. Semantic Request Deduplication Cache** Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work. **4. Inverse Prompt Engineering — Reduce Generation Entropy** Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity. --- ## Phase 2: Assessment (Parallel) ### D-Arbiter Evaluation | Idea | Novelty | Feasibility | Impact | Testability | Decision | |------|---------|-------------|--------|-------------|----------| | D1 - Model Router | 45 | 72 | 85 | 88 | **SHORTLIST** | | D2 - Token Budget + Cache | 38 | 78 | 62 | 82 | **SHORTLIST** | | D3 - Constrained Output | 30 | 90 | 48 | 95 | **SHORTLIST** | | D4 - Self-Hosted Batch | 52 | 48 | 90 | 60 | **HOLD** | Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable. Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques. Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics? Convergence: TRUE — right quadrant identified. Overall novelty: 38. Repetition rate: 72. ### P-Arbiter Evaluation | Idea | Novelty | Feasibility | Impact | Testability | Decision | |------|---------|-------------|--------|-------------|----------| | P1 - KV-Cache Architecture | 42 | 92 | 78 | 95 | **SHORTLIST** | | P2 - Consequence Routing | 81 | 62 | 88 | 71 | **SHORTLIST** | | P3 - Semantic Dedup Cache | 55 | 85 | 65 | 90 | **SHORTLIST** | | P4 - Inverse Prompt Eng | 38 | 80 | 72 | 85 | **HOLD** | Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy. Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content. Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals. Convergence: FALSE — consequence classification needs methodology, gap on batching/async. Overall novelty: 61. Repetition rate: 44. --- ## Phase 3: Meta-Arbiter Merge ### Selected Ideas **Primary Picks:** 1. **P1 — KV-Cache-Aware Prompt Architecture**: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic. 2. **D1 — Complexity-Gated Model Router**: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring. 3. **P2 — Consequence-Based Routing** (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap. **Secondary Picks:** 4. **D2 — Input Token Budget + Semantic Cache**: Modular, independently deployable, 30-55% combined impact. Strong supporting role. 5. **D3 — Constrained Output**: Pure hygiene, zero downside, 15-35%. Ship alongside P1. ### Productive Merges 1. **Unified Routing Signal (D1 + P2)**: Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge. 2. **Stacked Cache Layers (D2 + P1)**: Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%. 3. **Prompt Hygiene Sprint (D3 + P1)**: Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output). ### Rejections - P3 standalone (absorbed into Merge 2) - D4 current phase (operational surface too large before harvesting simpler wins) - P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams) ### Recommended Sequencing - Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3) - Week 2: Instrument consequence signals in production logs - Week 3-4: Deploy semantic cache pilot on highest-volume endpoint - Month 2: Build V1 complexity router, shadow test - Month 3: Test unified complexity × consequence router (Merge 1) ### Confidence Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.