swarm-zap/memory/council-runs/2026-03-05-mode-comparison.md

# Council Mode Comparison — Same Topic

**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
**Date**: 2026-03-05
**Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round

---

## Structural Comparison

| Dimension | Personality Mode | D/P Mode |
|-----------|-----------------|----------|
| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
| Total runtime | ~75s | ~3.5min |
| Approximate tokens | ~62k | ~77k |
| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |

---

## What Each Mode Produced

### Personality Mode Strengths
- **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
- **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
- **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
- **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration.

### D/P Mode Strengths
- **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
- **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
- **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone:
  - Unified complexity × consequence routing (D1 + P2)
  - Stacked cache layers at different architectural levels (D2 + P1)
  - Combined prompt hygiene sprint (D3 + P1)
- **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
- **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
- **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.

### D/P Mode Weaknesses
- **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic.
- **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
- **More expensive and slower**: ~25% more tokens, ~3x longer wall time.
- **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.

### Personality Mode Weaknesses
- **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
- **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
- **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.

---

## Key Insight Differences

Ideas that appeared in D/P but NOT in personality mode:
- **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing
- **Prompt compression via LLMLingua** — specific tooling recommendation
- **Constrained decoding / json_schema enforcement** as a cost lever
- **Embedding classifier replacement** for classification tasks
- **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter

Ideas that appeared in personality mode but NOT (or weakly) in D/P:
- **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics
- **Speculative decoding** — the Visionary's bet on draft-model verification
- **Neuromorphic hardware** — longer-term framing
- **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question

---

## When to Use Which

| Use Case | Recommended Mode |
|----------|-----------------|
| "Should we do X?" (opinion/judgment) | Personality |
| "How should we solve X?" (approaches) | D/P |
| Quick brainstorm, fast turnaround | Personality |
| Technical design with scoring/ranking | D/P |
| Need adversarial challenge / devil's advocate | Personality |
| Need complementary ideas from different optimization lenses | D/P |
| User wants a narrative they can read | Personality |
| User wants a decision matrix they can act on | D/P |

---

## Possible Improvements

1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.