Files
swarm-zap/memory/council-runs/2026-03-05-dp-inference-costs.md
zap c9fa2e1d95 docs(council): save D/P and personality run results + mode comparison
- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges
- Personality run: 4 subagents, ~62k tokens, produced narrative + verdict
- Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative
- Key finding: D/P lacks built-in skeptic, personality lacks structured scoring
- Proposed improvement: hybrid mode combining both strengths
2026-03-05 19:44:34 +00:00

111 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# D/P Council Run — LLM Inference Cost Reduction
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
**Mode**: D/P (Deterministic/Probabilistic)
**Flow**: Parallel, 1 round
**Tier**: Light (all subagents on default model — Sonnet 4.6)
**Date**: 2026-03-05 19:22 UTC
**Subagent calls**: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter)
---
## Phase 1: Ideation (Parallel)
### D-Freethinker — 4 Ideas
**1. Complexity-Gated Model Router**
Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial.
**2. Input Token Budget Enforcement with Semantic Deduplication**
Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision).
**3. Constrained Output Enforcement to Eliminate Retry Overhead**
Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%.
**4. Async Batch Offload to Self-Hosted Open-Weight Models**
Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend.
### P-Freethinker — 4 Ideas
**1. KV-Cache-Aware Prompt Architecture**
The real waste isn't in generation it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk.
**2. Task-Aware Speculative Routing via Consequence Classification**
Don't route on complexity — route on *consequence*. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact.
**3. Semantic Request Deduplication Cache**
Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work.
**4. Inverse Prompt Engineering — Reduce Generation Entropy**
Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity.
---
## Phase 2: Assessment (Parallel)
### D-Arbiter Evaluation
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|------|---------|-------------|--------|-------------|----------|
| D1 - Model Router | 45 | 72 | 85 | 88 | **SHORTLIST** |
| D2 - Token Budget + Cache | 38 | 78 | 62 | 82 | **SHORTLIST** |
| D3 - Constrained Output | 30 | 90 | 48 | 95 | **SHORTLIST** |
| D4 - Self-Hosted Batch | 52 | 48 | 90 | 60 | **HOLD** |
Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable.
Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques.
Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics?
Convergence: TRUE right quadrant identified.
Overall novelty: 38. Repetition rate: 72.
### P-Arbiter Evaluation
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|------|---------|-------------|--------|-------------|----------|
| P1 - KV-Cache Architecture | 42 | 92 | 78 | 95 | **SHORTLIST** |
| P2 - Consequence Routing | 81 | 62 | 88 | 71 | **SHORTLIST** |
| P3 - Semantic Dedup Cache | 55 | 85 | 65 | 90 | **SHORTLIST** |
| P4 - Inverse Prompt Eng | 38 | 80 | 72 | 85 | **HOLD** |
Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy.
Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content.
Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals.
Convergence: FALSE — consequence classification needs methodology, gap on batching/async.
Overall novelty: 61. Repetition rate: 44.
---
## Phase 3: Meta-Arbiter Merge
### Selected Ideas
**Primary Picks:**
1. **P1 — KV-Cache-Aware Prompt Architecture**: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic.
2. **D1 — Complexity-Gated Model Router**: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring.
3. **P2 — Consequence-Based Routing** (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap.
**Secondary Picks:**
4. **D2 — Input Token Budget + Semantic Cache**: Modular, independently deployable, 30-55% combined impact. Strong supporting role.
5. **D3 — Constrained Output**: Pure hygiene, zero downside, 15-35%. Ship alongside P1.
### Productive Merges
1. **Unified Routing Signal (D1 + P2)**: Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge.
2. **Stacked Cache Layers (D2 + P1)**: Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%.
3. **Prompt Hygiene Sprint (D3 + P1)**: Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output).
### Rejections
- P3 standalone (absorbed into Merge 2)
- D4 current phase (operational surface too large before harvesting simpler wins)
- P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams)
### Recommended Sequencing
- Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3)
- Week 2: Instrument consequence signals in production logs
- Week 3-4: Deploy semantic cache pilot on highest-volume endpoint
- Month 2: Build V1 complexity router, shadow test
- Month 3: Test unified complexity × consequence router (Merge 1)
### Confidence
Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.