Files
swarm-zap/memory/council-runs/2026-03-05-mode-comparison.md
zap c9fa2e1d95 docs(council): save D/P and personality run results + mode comparison
- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges
- Personality run: 4 subagents, ~62k tokens, produced narrative + verdict
- Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative
- Key finding: D/P lacks built-in skeptic, personality lacks structured scoring
- Proposed improvement: hybrid mode combining both strengths
2026-03-05 19:44:34 +00:00

92 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Council Mode Comparison — Same Topic
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
**Date**: 2026-03-05
**Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round
---
## Structural Comparison
| Dimension | Personality Mode | D/P Mode |
|-----------|-----------------|----------|
| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
| Total runtime | ~75s | ~3.5min |
| Approximate tokens | ~62k | ~77k |
| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |
---
## What Each Mode Produced
### Personality Mode Strengths
- **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
- **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
- **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
- **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration.
### D/P Mode Strengths
- **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
- **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
- **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone:
- Unified complexity × consequence routing (D1 + P2)
- Stacked cache layers at different architectural levels (D2 + P1)
- Combined prompt hygiene sprint (D3 + P1)
- **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
- **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
- **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.
### D/P Mode Weaknesses
- **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic.
- **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
- **More expensive and slower**: ~25% more tokens, ~3x longer wall time.
- **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.
### Personality Mode Weaknesses
- **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
- **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
- **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.
---
## Key Insight Differences
Ideas that appeared in D/P but NOT in personality mode:
- **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing
- **Prompt compression via LLMLingua** — specific tooling recommendation
- **Constrained decoding / json_schema enforcement** as a cost lever
- **Embedding classifier replacement** for classification tasks
- **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter
Ideas that appeared in personality mode but NOT (or weakly) in D/P:
- **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics
- **Speculative decoding** — the Visionary's bet on draft-model verification
- **Neuromorphic hardware** — longer-term framing
- **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question
---
## When to Use Which
| Use Case | Recommended Mode |
|----------|-----------------|
| "Should we do X?" (opinion/judgment) | Personality |
| "How should we solve X?" (approaches) | D/P |
| Quick brainstorm, fast turnaround | Personality |
| Technical design with scoring/ranking | D/P |
| Need adversarial challenge / devil's advocate | Personality |
| Need complementary ideas from different optimization lenses | D/P |
| User wants a narrative they can read | Personality |
| User wants a decision matrix they can act on | D/P |
---
## Possible Improvements
1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.