docs(council): save D/P and personality run results + mode comparison

- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges - Personality run: 4 subagents, ~62k tokens, produced narrative + verdict - Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative - Key finding: D/P lacks built-in skeptic, personality lacks structured scoring - Proposed improvement: hybrid mode combining both strengths
2026-03-05 19:44:34 +00:00
parent e08e3d65e9
commit c9fa2e1d95
3 changed files with 282 additions and 0 deletions
@@ -0,0 +1,81 @@
+# Personality Council Run — LLM Inference Cost Reduction
+
+**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
+**Mode**: Personality (Pragmatist / Visionary / Skeptic)
+**Flow**: Parallel, 1 round
+**Tier**: Light (all subagents on default model — Sonnet 4.6)
+**Date**: 2026-03-05 19:32 UTC
+**Subagent calls**: 4 (3 advisors parallel → 1 referee)
+
+---
+
+## Advisor Perspectives
+
+### Pragmatist
+
+Two concrete steps hit 50% in under two months:
+
+1. **Prompt caching (this week)**: Anthropic/OpenAI prefix caching. If system prompts are >1K tokens and reused frequently, cache hit rates of 70-90% are common. 20-35% cost reduction with zero quality impact.
+
+2. **Query router (Q2)**: Routing layer sends simple/classification tasks to smaller models (GPT-4o-mini, Haiku, Llama-3-8B). Can handle 60-80% of traffic at 5-10x lower cost. 40-60% reduction with measurable quality parity on routed tasks.
+
+3. **Self-hosting (later)**: Quantization + self-hosted inference for >$10K/month spend. 2-4 months to operationalize.
+
+Math: Routing (40%) + Caching (25% on remainder) ≈ 55% total. Achievable in under 2 months.
+
+Avoids: aggressive context truncation, speculative decoding alone, over-relying on fine-tuning.
+
+### Visionary
+
+"50% without quality loss" is the *wrong framing* — too conservative.
+
+Highest-leverage bets:
+1. **Speculative decoding at scale**: Smaller draft model proposes tokens for larger verifier. 2-3x throughput gains with zero quality degradation.
+2. **Mixture-of-Experts routing**: Learned, task-aware, dynamic routing. Don't need all 70B parameters for every task.
+3. **Hardware-native quantization**: INT4/INT8 with GPTQ/AWQ achieve near-parity on most benchmarks. Teams waiting for "perfect" quantization are leaving 40-60% savings on the table.
+4. **Long game**: Neuromorphic/in-memory compute (IBM NorthPole) — 2-3 years to meaningful production deployment.
+
+Recommendation: Don't optimize for 50% — architect for 10x by combining speculative decoding, dynamic MoE routing, and aggressive INT4 quantization now.
+
+### Skeptic
+
+"Without quality loss" is almost always a post-hoc rationalization, not a measured outcome.
+
+Key failure modes:
+1. **Tail-case invisibility**: Quantized/distilled models degrade on rare reasoning chains, novel instruction formats, edge-case domains — exactly where benchmarks don't measure and where it hurts most.
+2. **Silent quality leaks from routing**: Routing classifiers are wrong some percentage of the time, and those errors are systematically invisible (you don't know what the bigger model would have said).
+3. **Uneven compression**: Cost reduction hurts low-frequency, high-stakes requests more than high-frequency, low-stakes ones. If 5% of queries are what matters most, you can't see the damage in aggregate metrics.
+
+Prompt/batching caching is the safest lever but savings vary wildly by workload (could be 10%, not 80%).
+
+Recommendation: Measure quality on actual production distribution before and after any change, with specific focus on tail-case performance. Treat "without quality loss" as a hypothesis to falsify.
+
+---
+
+## Referee Verdict
+
+### Points of Agreement
+- Routing by task complexity is legitimate — all three accept it, Skeptic just demands measurement rigor
+- Prompt/KV caching is low-risk — closest thing to a free lunch
+- Quantization has real tradeoffs — even Visionary says "near-parity," not parity
+- Infrastructure complexity has a cost — none say "do everything at once"
+
+### Key Tensions
+
+1. **Optimize for now vs. architect for later**: Pragmatist wants 50% in two months; Visionary thinks that goal anchors you to a suboptimal architecture. Both have merit — near-term wins build trust and budget for larger bets.
+
+2. **"Quality parity" as fact vs. hypothesis**: Sharpest disagreement. Pragmatist asserts it as measured outcome; Skeptic says it's usually measured on the wrong distribution (head cases, not tail). Skeptic is technically correct — most published routing quality results are on aggregated benchmarks.
+
+3. **Benchmark claims vs. production reality**: Visionary cites research numbers; Skeptic notes controlled benchmarks don't reflect real traffic. Doesn't make Visionary wrong about direction, but implementation risk is higher than headlines suggest.
+
+### Recommendation (sequenced)
+
+1. **This week — Enable prompt caching.** Zero risk, 20-35% reduction. Just turn it on.
+2. **Before building a router — instrument your production distribution.** Log a sample with model outputs, establish baseline including tail-case performance. 1-2 weeks. De-risks everything that follows.
+3. **Q2 — Build query router with quality gates.** A/B test routing decisions, track tail performance separately from aggregate, define explicit rollback criteria. 40-50% additional reduction if quality holds.
+4. **Longer term — adopt Visionary's framing selectively.** Speculative decoding and MoE worth piloting once measurement infrastructure exists. Revisit in 6-12 months.
+
+The Pragmatist's sequence is the right spine — add the Skeptic's measurement layer before each step, not after.
+
+### Confidence
+Medium. Would shift if tail-case data revealed systematic quality degradation in routing — at that point, 50% target may require accepting some quality tradeoff.