Files

zap c9fa2e1d95 docs(council): save D/P and personality run results + mode comparison

- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges
- Personality run: 4 subagents, ~62k tokens, produced narrative + verdict
- Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative
- Key finding: D/P lacks built-in skeptic, personality lacks structured scoring
- Proposed improvement: hybrid mode combining both strengths

2026-03-05 19:44:34 +00:00

7.9 KiB

Raw Blame History

D/P Council Run — LLM Inference Cost Reduction

Topic: Best approach to reduce LLM inference costs by 50% without quality loss? Mode: D/P (Deterministic/Probabilistic) Flow: Parallel, 1 round Tier: Light (all subagents on default model — Sonnet 4.6) Date: 2026-03-05 19:22 UTC Subagent calls: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter)

Phase 1: Ideation (Parallel)

D-Freethinker — 4 Ideas

1. Complexity-Gated Model Router Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial.

2. Input Token Budget Enforcement with Semantic Deduplication Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision).

3. Constrained Output Enforcement to Eliminate Retry Overhead Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%.

4. Async Batch Offload to Self-Hosted Open-Weight Models Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend.

P-Freethinker — 4 Ideas

1. KV-Cache-Aware Prompt Architecture The real waste isn't in generation — it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk.

2. Task-Aware Speculative Routing via Consequence Classification Don't route on complexity — route on consequence. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact.

3. Semantic Request Deduplication Cache Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work.

4. Inverse Prompt Engineering — Reduce Generation Entropy Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity.

Phase 2: Assessment (Parallel)

D-Arbiter Evaluation

Idea	Novelty	Feasibility	Impact	Testability	Decision
D1 - Model Router	45	72	85	88	SHORTLIST
D2 - Token Budget + Cache	38	78	62	82	SHORTLIST
D3 - Constrained Output	30	90	48	95	SHORTLIST
D4 - Self-Hosted Batch	52	48	90	60	HOLD

Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable. Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques. Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics? Convergence: TRUE — right quadrant identified. Overall novelty: 38. Repetition rate: 72.

P-Arbiter Evaluation

Idea	Novelty	Feasibility	Impact	Testability	Decision
P1 - KV-Cache Architecture	42	92	78	95	SHORTLIST
P2 - Consequence Routing	81	62	88	71	SHORTLIST
P3 - Semantic Dedup Cache	55	85	65	90	SHORTLIST
P4 - Inverse Prompt Eng	38	80	72	85	HOLD

Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy. Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content. Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals. Convergence: FALSE — consequence classification needs methodology, gap on batching/async. Overall novelty: 61. Repetition rate: 44.

Phase 3: Meta-Arbiter Merge

Selected Ideas

Primary Picks:

P1 — KV-Cache-Aware Prompt Architecture: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic.
D1 — Complexity-Gated Model Router: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring.
P2 — Consequence-Based Routing (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap.

Secondary Picks: 4. D2 — Input Token Budget + Semantic Cache: Modular, independently deployable, 30-55% combined impact. Strong supporting role. 5. D3 — Constrained Output: Pure hygiene, zero downside, 15-35%. Ship alongside P1.

Productive Merges

Unified Routing Signal (D1 + P2): Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge.
Stacked Cache Layers (D2 + P1): Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%.
Prompt Hygiene Sprint (D3 + P1): Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output).

Rejections

P3 standalone (absorbed into Merge 2)
D4 current phase (operational surface too large before harvesting simpler wins)
P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams)

Recommended Sequencing

Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3)
Week 2: Instrument consequence signals in production logs
Week 3-4: Deploy semantic cache pilot on highest-volume endpoint
Month 2: Build V1 complexity router, shadow test
Month 3: Test unified complexity × consequence router (Merge 1)

Confidence

Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.

7.9 KiB Raw Blame History Unescape Escape