# Council Mode Comparison — Same Topic **Topic**: Best approach to reduce LLM inference costs by 50% without quality loss? **Date**: 2026-03-05 **Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round --- ## Structural Comparison | Dimension | Personality Mode | D/P Mode | |-----------|-----------------|----------| | Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) | | Total runtime | ~75s | ~3.5min | | Approximate tokens | ~62k | ~77k | | Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge | | Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) | | Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments | --- ## What Each Mode Produced ### Personality Mode Strengths - **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing). - **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on. - **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored. - **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration. ### D/P Mode Strengths - **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives. - **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode. - **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone: - Unified complexity × consequence routing (D1 + P2) - Stacked cache layers at different architectural levels (D2 + P1) - Combined prompt hygiene sprint (D3 + P1) - **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate. - **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round. - **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach. ### D/P Mode Weaknesses - **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic. - **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory. - **More expensive and slower**: ~25% more tokens, ~3x longer wall time. - **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok. ### Personality Mode Weaknesses - **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs. - **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix. - **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally. --- ## Key Insight Differences Ideas that appeared in D/P but NOT in personality mode: - **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing - **Prompt compression via LLMLingua** — specific tooling recommendation - **Constrained decoding / json_schema enforcement** as a cost lever - **Embedding classifier replacement** for classification tasks - **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter Ideas that appeared in personality mode but NOT (or weakly) in D/P: - **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics - **Speculative decoding** — the Visionary's bet on draft-model verification - **Neuromorphic hardware** — longer-term framing - **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question --- ## When to Use Which | Use Case | Recommended Mode | |----------|-----------------| | "Should we do X?" (opinion/judgment) | Personality | | "How should we solve X?" (approaches) | D/P | | Quick brainstorm, fast turnaround | Personality | | Technical design with scoring/ranking | D/P | | Need adversarial challenge / devil's advocate | Personality | | Need complementary ideas from different optimization lenses | D/P | | User wants a narrative they can read | Personality | | User wants a decision matrix they can act on | D/P | --- ## Possible Improvements 1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension. 2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks. 3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists. 4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.