docs(council): save D/P and personality run results + mode comparison

- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges - Personality run: 4 subagents, ~62k tokens, produced narrative + verdict - Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative - Key finding: D/P lacks built-in skeptic, personality lacks structured scoring - Proposed improvement: hybrid mode combining both strengths
2026-03-05 19:44:34 +00:00
parent e08e3d65e9
commit c9fa2e1d95
3 changed files with 282 additions and 0 deletions
@@ -0,0 +1,91 @@
+# Council Mode Comparison — Same Topic
+
+**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
+**Date**: 2026-03-05
+**Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round
+
+---
+
+## Structural Comparison
+
+| Dimension | Personality Mode | D/P Mode |
+|-----------|-----------------|----------|
+| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
+| Total runtime | ~75s | ~3.5min |
+| Approximate tokens | ~62k | ~77k |
+| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
+| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
+| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |
+
+---
+
+## What Each Mode Produced
+
+### Personality Mode Strengths
+- **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
+- **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
+- **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
+- **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration.
+
+### D/P Mode Strengths
+- **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
+- **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
+- **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone:
+  - Unified complexity × consequence routing (D1 + P2)
+  - Stacked cache layers at different architectural levels (D2 + P1)
+  - Combined prompt hygiene sprint (D3 + P1)
+- **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
+- **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
+- **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.
+
+### D/P Mode Weaknesses
+- **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic.
+- **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
+- **More expensive and slower**: ~25% more tokens, ~3x longer wall time.
+- **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.
+
+### Personality Mode Weaknesses
+- **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
+- **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
+- **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.
+
+---
+
+## Key Insight Differences
+
+Ideas that appeared in D/P but NOT in personality mode:
+- **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing
+- **Prompt compression via LLMLingua** — specific tooling recommendation
+- **Constrained decoding / json_schema enforcement** as a cost lever
+- **Embedding classifier replacement** for classification tasks
+- **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter
+
+Ideas that appeared in personality mode but NOT (or weakly) in D/P:
+- **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics
+- **Speculative decoding** — the Visionary's bet on draft-model verification
+- **Neuromorphic hardware** — longer-term framing
+- **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question
+
+---
+
+## When to Use Which
+
+| Use Case | Recommended Mode |
+|----------|-----------------|
+| "Should we do X?" (opinion/judgment) | Personality |
+| "How should we solve X?" (approaches) | D/P |
+| Quick brainstorm, fast turnaround | Personality |
+| Technical design with scoring/ranking | D/P |
+| Need adversarial challenge / devil's advocate | Personality |
+| Need complementary ideas from different optimization lenses | D/P |
+| User wants a narrative they can read | Personality |
+| User wants a decision matrix they can act on | D/P |
+
+---
+
+## Possible Improvements
+
+1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
+2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
+3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
+4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.