- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges - Personality run: 4 subagents, ~62k tokens, produced narrative + verdict - Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative - Key finding: D/P lacks built-in skeptic, personality lacks structured scoring - Proposed improvement: hybrid mode combining both strengths
7.9 KiB
D/P Council Run — LLM Inference Cost Reduction
Topic: Best approach to reduce LLM inference costs by 50% without quality loss? Mode: D/P (Deterministic/Probabilistic) Flow: Parallel, 1 round Tier: Light (all subagents on default model — Sonnet 4.6) Date: 2026-03-05 19:22 UTC Subagent calls: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter)
Phase 1: Ideation (Parallel)
D-Freethinker — 4 Ideas
1. Complexity-Gated Model Router Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial.
2. Input Token Budget Enforcement with Semantic Deduplication Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision).
3. Constrained Output Enforcement to Eliminate Retry Overhead Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%.
4. Async Batch Offload to Self-Hosted Open-Weight Models Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend.
P-Freethinker — 4 Ideas
1. KV-Cache-Aware Prompt Architecture The real waste isn't in generation — it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk.
2. Task-Aware Speculative Routing via Consequence Classification Don't route on complexity — route on consequence. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact.
3. Semantic Request Deduplication Cache Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work.
4. Inverse Prompt Engineering — Reduce Generation Entropy Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity.
Phase 2: Assessment (Parallel)
D-Arbiter Evaluation
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|---|---|---|---|---|---|
| D1 - Model Router | 45 | 72 | 85 | 88 | SHORTLIST |
| D2 - Token Budget + Cache | 38 | 78 | 62 | 82 | SHORTLIST |
| D3 - Constrained Output | 30 | 90 | 48 | 95 | SHORTLIST |
| D4 - Self-Hosted Batch | 52 | 48 | 90 | 60 | HOLD |
Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable. Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques. Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics? Convergence: TRUE — right quadrant identified. Overall novelty: 38. Repetition rate: 72.
P-Arbiter Evaluation
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|---|---|---|---|---|---|
| P1 - KV-Cache Architecture | 42 | 92 | 78 | 95 | SHORTLIST |
| P2 - Consequence Routing | 81 | 62 | 88 | 71 | SHORTLIST |
| P3 - Semantic Dedup Cache | 55 | 85 | 65 | 90 | SHORTLIST |
| P4 - Inverse Prompt Eng | 38 | 80 | 72 | 85 | HOLD |
Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy. Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content. Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals. Convergence: FALSE — consequence classification needs methodology, gap on batching/async. Overall novelty: 61. Repetition rate: 44.
Phase 3: Meta-Arbiter Merge
Selected Ideas
Primary Picks:
- P1 — KV-Cache-Aware Prompt Architecture: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic.
- D1 — Complexity-Gated Model Router: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring.
- P2 — Consequence-Based Routing (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap.
Secondary Picks: 4. D2 — Input Token Budget + Semantic Cache: Modular, independently deployable, 30-55% combined impact. Strong supporting role. 5. D3 — Constrained Output: Pure hygiene, zero downside, 15-35%. Ship alongside P1.
Productive Merges
- Unified Routing Signal (D1 + P2): Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge.
- Stacked Cache Layers (D2 + P1): Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%.
- Prompt Hygiene Sprint (D3 + P1): Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output).
Rejections
- P3 standalone (absorbed into Merge 2)
- D4 current phase (operational surface too large before harvesting simpler wins)
- P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams)
Recommended Sequencing
- Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3)
- Week 2: Instrument consequence signals in production logs
- Week 3-4: Deploy semantic cache pilot on highest-volume endpoint
- Month 2: Build V1 complexity router, shadow test
- Month 3: Test unified complexity × consequence router (Merge 1)
Confidence
Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.