docs(council): save D/P and personality run results + mode comparison
- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges - Personality run: 4 subagents, ~62k tokens, produced narrative + verdict - Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative - Key finding: D/P lacks built-in skeptic, personality lacks structured scoring - Proposed improvement: hybrid mode combining both strengths
This commit is contained in:
110
memory/council-runs/2026-03-05-dp-inference-costs.md
Normal file
110
memory/council-runs/2026-03-05-dp-inference-costs.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# D/P Council Run — LLM Inference Cost Reduction
|
||||
|
||||
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
|
||||
**Mode**: D/P (Deterministic/Probabilistic)
|
||||
**Flow**: Parallel, 1 round
|
||||
**Tier**: Light (all subagents on default model — Sonnet 4.6)
|
||||
**Date**: 2026-03-05 19:22 UTC
|
||||
**Subagent calls**: 5 (2 freethinkers → 2 arbiters → 1 meta-arbiter)
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: Ideation (Parallel)
|
||||
|
||||
### D-Freethinker — 4 Ideas
|
||||
|
||||
**1. Complexity-Gated Model Router**
|
||||
Route 60-75% of production traffic to cheap models via a complexity classifier trained on production logs. Use logprob confidence as escalation signal. Published routing studies (RouteLLM, FrugalGPT) show equivalent task accuracy on simple requests. Cost reduction: 50-70%. Rollback is trivial.
|
||||
|
||||
**2. Input Token Budget Enforcement with Semantic Deduplication**
|
||||
Three independently deployable sub-techniques: (a) Compress bloated prompts via LLMLingua (30-50% compression), (b) RAG context trimming to top-K most relevant chunks, (c) Semantic cache layer (cosine sim >0.95) for repeated queries (20-40% cache hit rate on high-repetition workloads). Combined: 30-55% cost reduction. Quality neutral to positive (less noise = better precision).
|
||||
|
||||
**3. Constrained Output Enforcement to Eliminate Retry Overhead**
|
||||
Use json_schema response_format and constrained decoding to eliminate format retries (typical 8-15% → <1%). Cap max_tokens tightly once output schema is known. Strip CoT from tasks where it doesn't measurably improve accuracy. Output token reduction: 20-40%. Combined: 15-35%.
|
||||
|
||||
**4. Async Batch Offload to Self-Hosted Open-Weight Models**
|
||||
Run background/async workloads (nightly summarization, classification, eval runs) on self-hosted Llama 3.1 70B or Qwen 2.5 72B via VLLM on spot GPUs. 80-90% cheaper per token. If 40-60% of workload is batch-eligible, blended reduction: 35-55%. Requires operational investment. Breakeven at ~$500/month API spend.
|
||||
|
||||
### P-Freethinker — 4 Ideas
|
||||
|
||||
**1. KV-Cache-Aware Prompt Architecture**
|
||||
The real waste isn't in generation — it's in context. Teams pay to reprocess the same static content (system prompts, docs, schemas) on every request, often 40-70% of total tokens. Restructure prompts so static content leads (cacheable prefix). Anthropic cached tokens cost 10% of normal. With 60% static content and >60% cache hit rate, input costs drop ~54% immediately. Ships in days, zero quality risk.
|
||||
|
||||
**2. Task-Aware Speculative Routing via Consequence Classification**
|
||||
Don't route on complexity — route on *consequence*. Build a consequence classifier (from production logs — corrections, escalations, retry rates) that predicts P(this request needs frontier model). Route ~70% low-consequence traffic to cheap models. Track quality per consequence segment, not aggregate. Blended cost drop: 60-75%. The routing signal isn't complexity; it's downstream impact.
|
||||
|
||||
**3. Semantic Request Deduplication Cache**
|
||||
Massive semantic redundancy exists in production that exact-match caching misses ("summarize for a 5-year-old" vs "ELI5 this"). Deploy embedding-based similarity cache (cosine >0.94) using fast local embeddings. For high-volume workloads with query overlap, expect 20-50% request deflection. Ships in a weekend with existing infrastructure. Threshold tuning is the intellectual work.
|
||||
|
||||
**4. Inverse Prompt Engineering — Reduce Generation Entropy**
|
||||
Audit generation variance per template. Replace open-ended instructions with structured output schemas (4-6x fewer output tokens). Externalize reasoning to cheap models and inject results as givens for frontier. Replace classification LLM calls entirely with few-shot embedding classifiers trained on LLM labels. Output token reduction: 40-70%. The cost is mostly generated by architectural defaults, not irreducible task complexity.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2: Assessment (Parallel)
|
||||
|
||||
### D-Arbiter Evaluation
|
||||
|
||||
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|
||||
|------|---------|-------------|--------|-------------|----------|
|
||||
| D1 - Model Router | 45 | 72 | 85 | 88 | **SHORTLIST** |
|
||||
| D2 - Token Budget + Cache | 38 | 78 | 62 | 82 | **SHORTLIST** |
|
||||
| D3 - Constrained Output | 30 | 90 | 48 | 95 | **SHORTLIST** |
|
||||
| D4 - Self-Hosted Batch | 52 | 48 | 90 | 60 | **HOLD** |
|
||||
|
||||
Key assumptions: Classifier can achieve <5% misclassification; cached queries are sufficiently homogeneous; format retries are currently measurable.
|
||||
Top risks: Silent quality regression from misclassification; compression artifacts on domain-specific content; compounding interaction effects when stacking techniques.
|
||||
Asks to P group: Non-obvious signals for better classifiers? Lateral plays on semantic caching? Hybrid path for self-hosting economics?
|
||||
Convergence: TRUE — right quadrant identified.
|
||||
Overall novelty: 38. Repetition rate: 72.
|
||||
|
||||
### P-Arbiter Evaluation
|
||||
|
||||
| Idea | Novelty | Feasibility | Impact | Testability | Decision |
|
||||
|------|---------|-------------|--------|-------------|----------|
|
||||
| P1 - KV-Cache Architecture | 42 | 92 | 78 | 95 | **SHORTLIST** |
|
||||
| P2 - Consequence Routing | 81 | 62 | 88 | 71 | **SHORTLIST** |
|
||||
| P3 - Semantic Dedup Cache | 55 | 85 | 65 | 90 | **SHORTLIST** |
|
||||
| P4 - Inverse Prompt Eng | 38 | 80 | 72 | 85 | **HOLD** |
|
||||
|
||||
Key assumptions: Cache hit rate >60% achievable; "consequence" can be operationally defined; query distribution has sufficient semantic redundancy.
|
||||
Top risks: Misrouting high-consequence requests causes silent quality failures; prompt restructuring may need app-layer refactors; cache staleness for time-sensitive content.
|
||||
Asks to D group: Concrete labeling scheme for "consequence"? Embedding latency feasibility at scale? Decompose P4 into atomic sub-proposals.
|
||||
Convergence: FALSE — consequence classification needs methodology, gap on batching/async.
|
||||
Overall novelty: 61. Repetition rate: 44.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3: Meta-Arbiter Merge
|
||||
|
||||
### Selected Ideas
|
||||
|
||||
**Primary Picks:**
|
||||
1. **P1 — KV-Cache-Aware Prompt Architecture**: Strongest single idea. Feasibility 92, testability 95, zero quality risk, ships in days. Impact is mechanistic, not probabilistic.
|
||||
2. **D1 — Complexity-Gated Model Router**: Highest impact ceiling (50-70%), well-understood pattern. Silent regression risk manageable with shadow scoring.
|
||||
3. **P2 — Consequence-Based Routing** (conditional): Novel reframing from complexity to consequence. Impact matches D1. Primary conditional on resolving operationalization gap.
|
||||
|
||||
**Secondary Picks:**
|
||||
4. **D2 — Input Token Budget + Semantic Cache**: Modular, independently deployable, 30-55% combined impact. Strong supporting role.
|
||||
5. **D3 — Constrained Output**: Pure hygiene, zero downside, 15-35%. Ship alongside P1.
|
||||
|
||||
### Productive Merges
|
||||
|
||||
1. **Unified Routing Signal (D1 + P2)**: Complexity (input-side) × Consequence (outcome-side) = orthogonal features that combine into a strictly better classifier. Production labels for consequence come from D1's escalation logs, solving P2's operationalization problem. Most important merge.
|
||||
2. **Stacked Cache Layers (D2 + P1)**: Semantic dedup cache (request-side) + KV-cache prefix (provider-side) work at different layers. Combined deflection could reach 60-70%.
|
||||
3. **Prompt Hygiene Sprint (D3 + P1)**: Both zero-risk, zero-infrastructure changes that compound (P1 cuts input, D3 cuts output).
|
||||
|
||||
### Rejections
|
||||
- P3 standalone (absorbed into Merge 2)
|
||||
- D4 current phase (operational surface too large before harvesting simpler wins)
|
||||
- P4 bundled (four ideas in a trenchcoat — atomic pieces absorbed into existing workstreams)
|
||||
|
||||
### Recommended Sequencing
|
||||
- Week 1: Audit prompts, enable prefix caching (P1), enforce output constraints (D3)
|
||||
- Week 2: Instrument consequence signals in production logs
|
||||
- Week 3-4: Deploy semantic cache pilot on highest-volume endpoint
|
||||
- Month 2: Build V1 complexity router, shadow test
|
||||
- Month 3: Test unified complexity × consequence router (Merge 1)
|
||||
|
||||
### Confidence
|
||||
Medium-high. P1 and D3 are high confidence. Routing (D1/P2/Merge 1) is medium — depends on eval harness and consequence labeling. The 50% target is achievable through P1 + D1 + D3 alone.
|
||||
91
memory/council-runs/2026-03-05-mode-comparison.md
Normal file
91
memory/council-runs/2026-03-05-mode-comparison.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# Council Mode Comparison — Same Topic
|
||||
|
||||
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
|
||||
**Date**: 2026-03-05
|
||||
**Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round
|
||||
|
||||
---
|
||||
|
||||
## Structural Comparison
|
||||
|
||||
| Dimension | Personality Mode | D/P Mode |
|
||||
|-----------|-----------------|----------|
|
||||
| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
|
||||
| Total runtime | ~75s | ~3.5min |
|
||||
| Approximate tokens | ~62k | ~77k |
|
||||
| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
|
||||
| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
|
||||
| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |
|
||||
|
||||
---
|
||||
|
||||
## What Each Mode Produced
|
||||
|
||||
### Personality Mode Strengths
|
||||
- **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
|
||||
- **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
|
||||
- **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
|
||||
- **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration.
|
||||
|
||||
### D/P Mode Strengths
|
||||
- **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
|
||||
- **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
|
||||
- **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone:
|
||||
- Unified complexity × consequence routing (D1 + P2)
|
||||
- Stacked cache layers at different architectural levels (D2 + P1)
|
||||
- Combined prompt hygiene sprint (D3 + P1)
|
||||
- **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
|
||||
- **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
|
||||
- **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.
|
||||
|
||||
### D/P Mode Weaknesses
|
||||
- **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic.
|
||||
- **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
|
||||
- **More expensive and slower**: ~25% more tokens, ~3x longer wall time.
|
||||
- **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.
|
||||
|
||||
### Personality Mode Weaknesses
|
||||
- **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
|
||||
- **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
|
||||
- **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.
|
||||
|
||||
---
|
||||
|
||||
## Key Insight Differences
|
||||
|
||||
Ideas that appeared in D/P but NOT in personality mode:
|
||||
- **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing
|
||||
- **Prompt compression via LLMLingua** — specific tooling recommendation
|
||||
- **Constrained decoding / json_schema enforcement** as a cost lever
|
||||
- **Embedding classifier replacement** for classification tasks
|
||||
- **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter
|
||||
|
||||
Ideas that appeared in personality mode but NOT (or weakly) in D/P:
|
||||
- **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics
|
||||
- **Speculative decoding** — the Visionary's bet on draft-model verification
|
||||
- **Neuromorphic hardware** — longer-term framing
|
||||
- **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question
|
||||
|
||||
---
|
||||
|
||||
## When to Use Which
|
||||
|
||||
| Use Case | Recommended Mode |
|
||||
|----------|-----------------|
|
||||
| "Should we do X?" (opinion/judgment) | Personality |
|
||||
| "How should we solve X?" (approaches) | D/P |
|
||||
| Quick brainstorm, fast turnaround | Personality |
|
||||
| Technical design with scoring/ranking | D/P |
|
||||
| Need adversarial challenge / devil's advocate | Personality |
|
||||
| Need complementary ideas from different optimization lenses | D/P |
|
||||
| User wants a narrative they can read | Personality |
|
||||
| User wants a decision matrix they can act on | D/P |
|
||||
|
||||
---
|
||||
|
||||
## Possible Improvements
|
||||
|
||||
1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
|
||||
2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
|
||||
3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
|
||||
4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.
|
||||
@@ -0,0 +1,81 @@
|
||||
# Personality Council Run — LLM Inference Cost Reduction
|
||||
|
||||
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
|
||||
**Mode**: Personality (Pragmatist / Visionary / Skeptic)
|
||||
**Flow**: Parallel, 1 round
|
||||
**Tier**: Light (all subagents on default model — Sonnet 4.6)
|
||||
**Date**: 2026-03-05 19:32 UTC
|
||||
**Subagent calls**: 4 (3 advisors parallel → 1 referee)
|
||||
|
||||
---
|
||||
|
||||
## Advisor Perspectives
|
||||
|
||||
### Pragmatist
|
||||
|
||||
Two concrete steps hit 50% in under two months:
|
||||
|
||||
1. **Prompt caching (this week)**: Anthropic/OpenAI prefix caching. If system prompts are >1K tokens and reused frequently, cache hit rates of 70-90% are common. 20-35% cost reduction with zero quality impact.
|
||||
|
||||
2. **Query router (Q2)**: Routing layer sends simple/classification tasks to smaller models (GPT-4o-mini, Haiku, Llama-3-8B). Can handle 60-80% of traffic at 5-10x lower cost. 40-60% reduction with measurable quality parity on routed tasks.
|
||||
|
||||
3. **Self-hosting (later)**: Quantization + self-hosted inference for >$10K/month spend. 2-4 months to operationalize.
|
||||
|
||||
Math: Routing (40%) + Caching (25% on remainder) ≈ 55% total. Achievable in under 2 months.
|
||||
|
||||
Avoids: aggressive context truncation, speculative decoding alone, over-relying on fine-tuning.
|
||||
|
||||
### Visionary
|
||||
|
||||
"50% without quality loss" is the *wrong framing* — too conservative.
|
||||
|
||||
Highest-leverage bets:
|
||||
1. **Speculative decoding at scale**: Smaller draft model proposes tokens for larger verifier. 2-3x throughput gains with zero quality degradation.
|
||||
2. **Mixture-of-Experts routing**: Learned, task-aware, dynamic routing. Don't need all 70B parameters for every task.
|
||||
3. **Hardware-native quantization**: INT4/INT8 with GPTQ/AWQ achieve near-parity on most benchmarks. Teams waiting for "perfect" quantization are leaving 40-60% savings on the table.
|
||||
4. **Long game**: Neuromorphic/in-memory compute (IBM NorthPole) — 2-3 years to meaningful production deployment.
|
||||
|
||||
Recommendation: Don't optimize for 50% — architect for 10x by combining speculative decoding, dynamic MoE routing, and aggressive INT4 quantization now.
|
||||
|
||||
### Skeptic
|
||||
|
||||
"Without quality loss" is almost always a post-hoc rationalization, not a measured outcome.
|
||||
|
||||
Key failure modes:
|
||||
1. **Tail-case invisibility**: Quantized/distilled models degrade on rare reasoning chains, novel instruction formats, edge-case domains — exactly where benchmarks don't measure and where it hurts most.
|
||||
2. **Silent quality leaks from routing**: Routing classifiers are wrong some percentage of the time, and those errors are systematically invisible (you don't know what the bigger model would have said).
|
||||
3. **Uneven compression**: Cost reduction hurts low-frequency, high-stakes requests more than high-frequency, low-stakes ones. If 5% of queries are what matters most, you can't see the damage in aggregate metrics.
|
||||
|
||||
Prompt/batching caching is the safest lever but savings vary wildly by workload (could be 10%, not 80%).
|
||||
|
||||
Recommendation: Measure quality on actual production distribution before and after any change, with specific focus on tail-case performance. Treat "without quality loss" as a hypothesis to falsify.
|
||||
|
||||
---
|
||||
|
||||
## Referee Verdict
|
||||
|
||||
### Points of Agreement
|
||||
- Routing by task complexity is legitimate — all three accept it, Skeptic just demands measurement rigor
|
||||
- Prompt/KV caching is low-risk — closest thing to a free lunch
|
||||
- Quantization has real tradeoffs — even Visionary says "near-parity," not parity
|
||||
- Infrastructure complexity has a cost — none say "do everything at once"
|
||||
|
||||
### Key Tensions
|
||||
|
||||
1. **Optimize for now vs. architect for later**: Pragmatist wants 50% in two months; Visionary thinks that goal anchors you to a suboptimal architecture. Both have merit — near-term wins build trust and budget for larger bets.
|
||||
|
||||
2. **"Quality parity" as fact vs. hypothesis**: Sharpest disagreement. Pragmatist asserts it as measured outcome; Skeptic says it's usually measured on the wrong distribution (head cases, not tail). Skeptic is technically correct — most published routing quality results are on aggregated benchmarks.
|
||||
|
||||
3. **Benchmark claims vs. production reality**: Visionary cites research numbers; Skeptic notes controlled benchmarks don't reflect real traffic. Doesn't make Visionary wrong about direction, but implementation risk is higher than headlines suggest.
|
||||
|
||||
### Recommendation (sequenced)
|
||||
|
||||
1. **This week — Enable prompt caching.** Zero risk, 20-35% reduction. Just turn it on.
|
||||
2. **Before building a router — instrument your production distribution.** Log a sample with model outputs, establish baseline including tail-case performance. 1-2 weeks. De-risks everything that follows.
|
||||
3. **Q2 — Build query router with quality gates.** A/B test routing decisions, track tail performance separately from aggregate, define explicit rollback criteria. 40-50% additional reduction if quality holds.
|
||||
4. **Longer term — adopt Visionary's framing selectively.** Speculative decoding and MoE worth piloting once measurement infrastructure exists. Revisit in 6-12 months.
|
||||
|
||||
The Pragmatist's sequence is the right spine — add the Skeptic's measurement layer before each step, not after.
|
||||
|
||||
### Confidence
|
||||
Medium. Would shift if tail-case data revealed systematic quality degradation in routing — at that point, 50% target may require accepting some quality tradeoff.
|
||||
Reference in New Issue
Block a user