- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges - Personality run: 4 subagents, ~62k tokens, produced narrative + verdict - Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative - Key finding: D/P lacks built-in skeptic, personality lacks structured scoring - Proposed improvement: hybrid mode combining both strengths
6.2 KiB
6.2 KiB
Council Mode Comparison — Same Topic
Topic: Best approach to reduce LLM inference costs by 50% without quality loss? Date: 2026-03-05 Both runs: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round
Structural Comparison
| Dimension | Personality Mode | D/P Mode |
|---|---|---|
| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
| Total runtime | ~75s | ~3.5min |
| Approximate tokens | ~62k | ~77k |
| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |
What Each Mode Produced
Personality Mode Strengths
- The Skeptic's "tail-case invisibility" insight was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
- Cleaner narrative arc: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
- The Visionary pushed scope: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
- Faster and cheaper: 4 subagent calls vs 5, simpler orchestration.
D/P Mode Strengths
- More concrete ideas: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
- Scoring and filtering: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
- Cross-group merges were genuinely valuable: The meta-arbiter identified 3 productive merges that neither group proposed alone:
- Unified complexity × consequence routing (D1 + P2)
- Stacked cache layers at different architectural levels (D2 + P1)
- Combined prompt hygiene sprint (D3 + P1)
- Asks between groups surfaced gaps: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
- Convergence signals: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
- The "consequence vs complexity" distinction (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.
D/P Mode Weaknesses
- No adversarial tension: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates complementary ideas, not opposing ones. There's no built-in skeptic.
- Repetition across groups: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
- More expensive and slower: ~25% more tokens, ~3x longer wall time.
- Harder to read: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.
Personality Mode Weaknesses
- Thin on specifics: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
- No scoring or prioritization: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
- No cross-pollination mechanism: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.
Key Insight Differences
Ideas that appeared in D/P but NOT in personality mode:
- Consequence-based routing (vs complexity-based) — a genuinely novel reframing
- Prompt compression via LLMLingua — specific tooling recommendation
- Constrained decoding / json_schema enforcement as a cost lever
- Embedding classifier replacement for classification tasks
- Concrete sequencing timeline (week 1 → month 3) from meta-arbiter
Ideas that appeared in personality mode but NOT (or weakly) in D/P:
- "Tail-case invisibility" — the Skeptic's insight about quality degradation being invisible in aggregate metrics
- Speculative decoding — the Visionary's bet on draft-model verification
- Neuromorphic hardware — longer-term framing
- "Treat 'without quality loss' as hypothesis to falsify" — epistemological reframing of the entire question
When to Use Which
| Use Case | Recommended Mode |
|---|---|
| "Should we do X?" (opinion/judgment) | Personality |
| "How should we solve X?" (approaches) | D/P |
| Quick brainstorm, fast turnaround | Personality |
| Technical design with scoring/ranking | D/P |
| Need adversarial challenge / devil's advocate | Personality |
| Need complementary ideas from different optimization lenses | D/P |
| User wants a narrative they can read | Personality |
| User wants a decision matrix they can act on | D/P |
Possible Improvements
- Hybrid mode: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
- Add a Skeptic role to D/P: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
- Multi-round D/P with bridge packets: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
- Unified output format: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.