docs(council): add experimental findings from all 3 flow types

- Tested parallel 1-round, sequential 1-round, debate/parallel 3-round - 3 rounds is sweet spot: positions converge, meaningful evolution - Sequential most token-efficient; parallel 3-round best depth-to-cost - Debate and parallel 3-round mechanically identical (prompt tone differs) - Added cost profiles, recommended defaults by use case - Updated TODOs: unify flows, test 2-round, test mixed model tiers
2026-03-05 16:39:32 +00:00
parent da36000050
commit 3e198bcbb3
2 changed files with 45 additions and 1 deletions
@@ -23,3 +23,16 @@
  - Revisit advisor personality depth (richer backstories).
  - Revisit skill name ("council" is placeholder).
  - Experiment with different round counts and flows for optimal depth/cost tradeoffs.
+
+## Council experiments completed
+- Ran all 3 flow types on same topic ("Should AI assistants have persistent memory?"):
+  1. **Parallel 1-round** (Experiment 1): Fast, clean, independent perspectives. 4 subagent calls, ~60k tokens.
+  2. **Sequential 1-round** (Experiment 2): Tighter dialogue — later advisors build on earlier. 4 calls, ~55k tokens. Less redundancy.
+  3. **Debate/Parallel 3-round** (Experiment 3): Richest output. Positions evolved significantly across rounds (Visionary backed off always-on, Skeptic softened on trajectory). 10 calls, ~130k tokens.
+- Key findings:
+  - 3 rounds is the sweet spot for depth — positions converge by round 3.
+  - Sequential is most token-efficient for focused topics.
+  - Parallel 3-round is best depth-to-cost ratio for substantive topics.
+  - Debate and parallel 3-round are mechanically identical — differ only in prompt tone.
+- Updated SKILL.md with experimental findings, recommended defaults by use case, cost profiles.
+- New TODOs added: unify debate/parallel flows, test 2-round sufficiency, test mixed model tiers.
@@ -92,6 +92,35 @@ See `references/prompts.md` for all prompt templates. Key points:
 - **Final round**: Ask for final synthesis — what changed, what held firm, final recommendation in 2-3 sentences. Keep shortest (150-250 words).
 - **Referee (multi-round)**: Include the FULL debate transcript organized by round. Ask referee to note position evolution, not just final states.

+## Experimental Findings
+
+Tested all 3 flows on the same topic ("Should AI assistants have persistent memory?"):
+
+### Parallel 1-round vs Parallel 3-round
+- **1-round**: Fast, good for quick takes. Advisors give independent positions, referee synthesizes. Clean but no cross-pollination — advisors can't respond to each other's arguments.
+- **3-round**: Significantly richer. Positions evolved meaningfully — the Visionary stepped back from always-on after engaging with Skeptic's arguments, the Skeptic softened on trajectory. Referee captured evolution. **Best overall depth-to-cost ratio.**
+- **Takeaway**: 3 rounds is the sweet spot. 1 round works for quick brainstorms. More than 3 likely hits diminishing returns (positions converge by round 3).
+
+### Sequential vs Parallel
+- **Sequential**: Later advisors build directly on earlier ones — less redundancy, more focused rebuttals. The Skeptic (speaking last) gave the sharpest response because they could address both prior positions directly. But earlier advisors can't respond to later ones without extra rounds.
+- **Parallel**: Advisors are more independent, sometimes overlapping. But each brings a genuinely uninfluenced perspective in round 1, which can surface blind spots that sequential misses.
+- **Takeaway**: Sequential produces tighter dialogue in fewer total subagent calls (3 advisors + 1 referee = 4 calls). Parallel gives more independent coverage but needs multi-round for depth (3 advisors x 3 rounds + 1 referee = 10 calls).
+
+### Debate (parallel 3-round) vs Parallel 3-round
+- The flows are mechanically identical in our implementation. The distinction is mainly about prompt framing — debate prompts emphasize direct engagement ("respond to the Visionary's claim that...") while parallel rebuttal prompts are more general ("where do you agree or push back?").
+- **Takeaway**: These can be unified. The "debate" label is useful for user-facing intent ("I want them to argue") but doesn't need a separate mechanical flow.
+
+### Cost profile (approximate, per run on default model tier)
+- Parallel 1-round: ~4 subagent calls, ~60k tokens total
+- Sequential 1-round: ~4 subagent calls, ~55k tokens total (slightly less due to no parallel redundancy)
+- Parallel/Debate 3-round: ~10 subagent calls, ~130k tokens total
+
+### Recommended defaults by use case
+- **Quick brainstorm**: `flow=parallel, rounds=1` — fast, cheap, good enough for casual topics
+- **Balanced analysis**: `flow=parallel, rounds=3` — best depth-to-cost ratio, recommended default for substantive topics
+- **Tight dialogue**: `flow=sequential, rounds=1` — fewest calls, good for focused topics where building on each other matters
+- **Deep dive**: `flow=debate, rounds=3` — same as parallel 3-round with more combative prompting
+
 ## Implementation

 Read `scripts/council.sh` for the orchestration logic.
@@ -106,4 +135,6 @@ Default roster and prompt templates live in `references/prompts.md`.
 ## TODO (revisit later)
 - Revisit subagent personality depth — richer backstories, communication styles
 - Revisit skill name — "council" works for now
- Experiment with different round counts and flows to find optimal depth/cost tradeoffs
+- Consider unifying debate and parallel flows (mechanically identical, differ only in prompt tone)
+- Explore whether 2 rounds is sufficient for most topics (vs 3)
+- Test with different model tiers for advisors vs referee