Files
swarm-zap/memory/council-runs/2026-03-05-mode-comparison.md
zap c9fa2e1d95 docs(council): save D/P and personality run results + mode comparison
- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges
- Personality run: 4 subagents, ~62k tokens, produced narrative + verdict
- Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative
- Key finding: D/P lacks built-in skeptic, personality lacks structured scoring
- Proposed improvement: hybrid mode combining both strengths
2026-03-05 19:44:34 +00:00

6.2 KiB
Raw Blame History

Council Mode Comparison — Same Topic

Topic: Best approach to reduce LLM inference costs by 50% without quality loss? Date: 2026-03-05 Both runs: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round


Structural Comparison

Dimension Personality Mode D/P Mode
Subagent calls 4 (3 advisors + 1 referee) 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter)
Total runtime ~75s ~3.5min
Approximate tokens ~62k ~77k
Output structure Opinions → Synthesis Ideas → Scored shortlists → Cross-group merge
Diversity source Personality lenses (how they think) Cognitive style (what they optimize for)
Final output Sequenced recommendation with tensions Selected ideas with merges, rejections, experiments

What Each Mode Produced

Personality Mode Strengths

  • The Skeptic's "tail-case invisibility" insight was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
  • Cleaner narrative arc: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
  • The Visionary pushed scope: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
  • Faster and cheaper: 4 subagent calls vs 5, simpler orchestration.

D/P Mode Strengths

  • More concrete ideas: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
  • Scoring and filtering: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
  • Cross-group merges were genuinely valuable: The meta-arbiter identified 3 productive merges that neither group proposed alone:
    • Unified complexity × consequence routing (D1 + P2)
    • Stacked cache layers at different architectural levels (D2 + P1)
    • Combined prompt hygiene sprint (D3 + P1)
  • Asks between groups surfaced gaps: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
  • Convergence signals: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
  • The "consequence vs complexity" distinction (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.

D/P Mode Weaknesses

  • No adversarial tension: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates complementary ideas, not opposing ones. There's no built-in skeptic.
  • Repetition across groups: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
  • More expensive and slower: ~25% more tokens, ~3x longer wall time.
  • Harder to read: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.

Personality Mode Weaknesses

  • Thin on specifics: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
  • No scoring or prioritization: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
  • No cross-pollination mechanism: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.

Key Insight Differences

Ideas that appeared in D/P but NOT in personality mode:

  • Consequence-based routing (vs complexity-based) — a genuinely novel reframing
  • Prompt compression via LLMLingua — specific tooling recommendation
  • Constrained decoding / json_schema enforcement as a cost lever
  • Embedding classifier replacement for classification tasks
  • Concrete sequencing timeline (week 1 → month 3) from meta-arbiter

Ideas that appeared in personality mode but NOT (or weakly) in D/P:

  • "Tail-case invisibility" — the Skeptic's insight about quality degradation being invisible in aggregate metrics
  • Speculative decoding — the Visionary's bet on draft-model verification
  • Neuromorphic hardware — longer-term framing
  • "Treat 'without quality loss' as hypothesis to falsify" — epistemological reframing of the entire question

When to Use Which

Use Case Recommended Mode
"Should we do X?" (opinion/judgment) Personality
"How should we solve X?" (approaches) D/P
Quick brainstorm, fast turnaround Personality
Technical design with scoring/ranking D/P
Need adversarial challenge / devil's advocate Personality
Need complementary ideas from different optimization lenses D/P
User wants a narrative they can read Personality
User wants a decision matrix they can act on D/P

Possible Improvements

  1. Hybrid mode: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
  2. Add a Skeptic role to D/P: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
  3. Multi-round D/P with bridge packets: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
  4. Unified output format: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.