docs(council): save D/P and personality run results + mode comparison

- D/P run: 5 subagents, ~77k tokens, produced scored shortlists + merges
- Personality run: 4 subagents, ~62k tokens, produced narrative + verdict
- Comparison: D/P better for concrete ideas/scoring, personality better for adversarial tension/narrative
- Key finding: D/P lacks built-in skeptic, personality lacks structured scoring
- Proposed improvement: hybrid mode combining both strengths
This commit is contained in:
zap
2026-03-05 19:44:34 +00:00
parent e08e3d65e9
commit c9fa2e1d95
3 changed files with 282 additions and 0 deletions

View File

@@ -0,0 +1,91 @@
# Council Mode Comparison — Same Topic
**Topic**: Best approach to reduce LLM inference costs by 50% without quality loss?
**Date**: 2026-03-05
**Both runs**: Tier light (Sonnet 4.6 for all subagents), parallel flow, single round
---
## Structural Comparison
| Dimension | Personality Mode | D/P Mode |
|-----------|-----------------|----------|
| Subagent calls | 4 (3 advisors + 1 referee) | 5 (2 freethinkers + 2 arbiters + 1 meta-arbiter) |
| Total runtime | ~75s | ~3.5min |
| Approximate tokens | ~62k | ~77k |
| Output structure | Opinions → Synthesis | Ideas → Scored shortlists → Cross-group merge |
| Diversity source | Personality lenses (how they think) | Cognitive style (what they optimize for) |
| Final output | Sequenced recommendation with tensions | Selected ideas with merges, rejections, experiments |
---
## What Each Mode Produced
### Personality Mode Strengths
- **The Skeptic's "tail-case invisibility" insight** was the sharpest single contribution across both runs. The concept: quality degradation from routing/quantization hits rare, high-stakes queries hardest — exactly where benchmarks don't measure and where damage is most consequential. This reframing changed the referee's entire recommendation sequence (instrument before routing).
- **Cleaner narrative arc**: Three perspectives → tensions → sequenced verdict. Easier to read and act on.
- **The Visionary pushed scope**: "50% is too conservative, architect for 10x" is a useful provocation even if the referee didn't fully adopt it. It ensured long-term options weren't ignored.
- **Faster and cheaper**: 4 subagent calls vs 5, simpler orchestration.
### D/P Mode Strengths
- **More concrete ideas**: 8 distinct proposals (4 per group) vs 3 position papers. D/P produced actionable workstreams, not just perspectives.
- **Scoring and filtering**: Arbiters scored every idea on novelty/feasibility/impact/testability and made explicit shortlist/hold/reject decisions. This structured evaluation doesn't exist in personality mode.
- **Cross-group merges were genuinely valuable**: The meta-arbiter identified 3 productive merges that neither group proposed alone:
- Unified complexity × consequence routing (D1 + P2)
- Stacked cache layers at different architectural levels (D2 + P1)
- Combined prompt hygiene sprint (D3 + P1)
- **Asks between groups surfaced gaps**: D-Arbiter asked P for non-obvious classifier signals; P-Arbiter asked D for concrete consequence labeling schemes. This cross-pollination wouldn't happen in personality mode without multi-round debate.
- **Convergence signals**: Arbiters explicitly rated whether their group had found its best ideas (D: yes, P: no), which could inform whether to run another round.
- **The "consequence vs complexity" distinction** (P2) was a more novel framing than anything in the personality run. Routing on downstream impact rather than input features is a genuinely different approach.
### D/P Mode Weaknesses
- **No adversarial tension**: Neither group questioned "is 50% even the right goal?" or "will this actually work in production?" The D/P structure generates *complementary* ideas, not *opposing* ones. There's no built-in skeptic.
- **Repetition across groups**: Both groups independently proposed model routing and semantic caching. The meta-arbiter had to merge rather than synthesize genuinely different territory.
- **More expensive and slower**: ~25% more tokens, ~3x longer wall time.
- **Harder to read**: The output is a spreadsheet, not a story. Good for structured decision-making, harder for a human to quickly grok.
### Personality Mode Weaknesses
- **Thin on specifics**: The Pragmatist said "build a query router" but didn't propose concrete approaches. The D-Freethinker produced 4 specific router designs.
- **No scoring or prioritization**: The referee synthesized qualitatively but didn't score or rank. You get a narrative, not a decision matrix.
- **No cross-pollination mechanism**: Advisors don't build on each other in single-round. Would need multi-round debate (at higher cost) to get the interaction that D/P gets structurally.
---
## Key Insight Differences
Ideas that appeared in D/P but NOT in personality mode:
- **Consequence-based routing** (vs complexity-based) — a genuinely novel reframing
- **Prompt compression via LLMLingua** — specific tooling recommendation
- **Constrained decoding / json_schema enforcement** as a cost lever
- **Embedding classifier replacement** for classification tasks
- **Concrete sequencing timeline** (week 1 → month 3) from meta-arbiter
Ideas that appeared in personality mode but NOT (or weakly) in D/P:
- **"Tail-case invisibility"** — the Skeptic's insight about quality degradation being invisible in aggregate metrics
- **Speculative decoding** — the Visionary's bet on draft-model verification
- **Neuromorphic hardware** — longer-term framing
- **"Treat 'without quality loss' as hypothesis to falsify"** — epistemological reframing of the entire question
---
## When to Use Which
| Use Case | Recommended Mode |
|----------|-----------------|
| "Should we do X?" (opinion/judgment) | Personality |
| "How should we solve X?" (approaches) | D/P |
| Quick brainstorm, fast turnaround | Personality |
| Technical design with scoring/ranking | D/P |
| Need adversarial challenge / devil's advocate | Personality |
| Need complementary ideas from different optimization lenses | D/P |
| User wants a narrative they can read | Personality |
| User wants a decision matrix they can act on | D/P |
---
## Possible Improvements
1. **Hybrid mode**: Run D/P for ideation, then pass results to a Skeptic advisor for adversarial review before the meta-arbiter merges. Gets both structured ideas AND adversarial tension.
2. **Add a Skeptic role to D/P**: A third "adversarial evaluator" alongside the two arbiters who specifically looks for failure modes, hidden assumptions, and tail risks.
3. **Multi-round D/P with bridge packets**: The arbiter "asks" are a natural bridge — running a second round where each group addresses the other's asks would likely improve both shortlists.
4. **Unified output format**: Both modes should produce a comparable final document. Currently personality mode gives a narrative and D/P gives a structured report — hard to compare directly.