- 5-phase plan: config, structured output, bridge caps, E2E run, zap integration - Work to happen on fix/council-pipeline branch in ~/flynn - Goal: get Flynn's dual-council working so zap can delegate to it
6.5 KiB
Flynn Council Pipeline — Fix Plan
Goal: Get Flynn's dual-council pipeline (council.run) working against real models so zap can delegate council tasks to Flynn as an external agent.
Branch: fix/council-pipeline (off main)
Status: The orchestrator code, types, schemas, tool registration, TUI /council command, and preflight check all exist. Unit tests pass (mocked). But the pipeline has never run successfully against real models.
Phase 1: Configuration & Agent Setup
Problem: The council requires 5 named agents in agent_configs that don't exist in the default config (everything is commented out).
Tasks:
- Uncomment and populate
councilsblock inconfig/default.yamlwithenabled: true. - Define the 5 required agent configs:
council_d_arbiter— D-group arbiter (feasibility-focused, structured JSON output)council_d_freethinker— D-group freethinker (ideation, boring-but-true)council_p_arbiter— P-group arbiter (novelty-focused, structured JSON output)council_p_freethinker— P-group freethinker (ideation, weird-is-fine)council_meta_arbiter— Meta merge agent (selects across both groups)
- Each agent needs:
- A
system_promptthat matches the pipeline's expected behavior (JSON-only output, role-specific framing) - A
model_tier(start withdefaultfor all; upgrade meta tocomplexafter first success)
- A
- Decide whether to add grounder/writer agents or skip them initially (recommendation: skip, they're optional).
Acceptance: flynn tui → /council preflight shows all agents resolved, tiers probed OK, no [agent_missing] flags.
Phase 2: Structured Output Compatibility
Problem: The orchestrator demands strict JSON schema output (responseFormat: jsonSchemaFormat(...)) from every agent call. Most models handle this poorly or inconsistently. The pipeline has JSON repair + agent-based recovery, but if the underlying model doesn't support response_format: json_schema, it may fail before repair kicks in.
Tasks:
- Verify which models/providers in Flynn's config support
response_formatwithjson_schematype.- OpenAI GPT-4o+: yes
- Anthropic Claude: no native
json_schema(uses prompt-based JSON) - Copilot/OpenRouter: depends on underlying model
- Ollama: partial support
- Check how Flynn's model router handles
responseFormatfor providers that don't support it — does it silently drop it, error, or adapt?- File:
src/models/— check provider adapters
- File:
- If needed, make the
responseFormatparameter gracefully degrade:- For providers without
json_schemasupport, rely on the system prompt directive ("Return JSON only...") + the existingparseWithAgentRecoveryfallback - Don't hard-fail if the provider ignores
responseFormat
- For providers without
- Test with the actual configured model to confirm JSON output parses correctly through the Zod schemas.
Acceptance: A single group round (D, round 1) completes without repair_failed or parse_failed using the configured model.
Phase 3: Bridge & Cap Validation
Problem: enforceBridgeCaps() throws hard on any cap violation (cap_exceeded), which kills the entire run. Real model output is likely to exceed the tight defaults (e.g., bridge_entry_max_chars: 300).
Tasks:
- Review default cap values and increase if they're too restrictive for real output:
bridge_packet_max_chars: 2500— may need 4000-5000bridge_entry_max_chars: 300— may need 500-800bridge_field_max_bullets: 6— probably fine
- Consider making
enforceBridgeCapstruncate rather than throw — trim entries to max chars, drop excess bullets, with a trace warning. - Alternatively, add a
strict_bridge: falseconfig option that allows soft enforcement.
Acceptance: A 2-round run completes without bridge_validation_failed stop reason.
Phase 4: End-to-End Run
Tasks:
- Run
/council preflight— confirm clean. - Run
/council <simple test task>— e.g., "What's the best approach to add persistent memory to an AI assistant?" - Verify:
- Pipeline reaches
max_roundsorconvergencestop reason (not an error). - Both D and P groups produce shortlists.
- Meta merge produces
selected_primaryandselected_secondary. - Artifacts are written to
~/.local/share/flynn/councils/. - Markdown summary is human-readable and useful.
- Pipeline reaches
- Fix any issues surfaced during the run (likely: JSON format, cap overflow, agent prompt tuning).
Acceptance: At least one clean end-to-end run with real models, artifacts saved, readable output.
Phase 5: Integration with Zap (OpenClaw)
Goal: Let zap delegate council tasks to Flynn via external agent invocation.
Tasks:
- Determine the integration path:
- Option A: Flynn exposes a CLI command (
flynn council run --task "...") that zap can call viaexec. - Option B: Flynn exposes an HTTP endpoint for council runs (if gateway supports it).
- Option C: Zap uses
sessions_spawnto invoke Flynn as an ACP agent with a council task.
- Option A: Flynn exposes a CLI command (
- Implement the chosen path (likely Option A as simplest):
- Add
flynn council run --task "<task>" [--max-rounds N] [--output json|markdown]CLI subcommand. - Output the markdown summary to stdout, JSON to a file.
- Add
- Update zap's council skill to support a
backend: flynnoption that delegates to Flynn instead of spawning subagents.
Acceptance: Zap can invoke flynn council run --task "..." and get structured output back.
Estimated Work
| Phase | Effort | Risk |
|---|---|---|
| 1. Config & agents | Small (config-only) | Low |
| 2. Structured output | Medium (may need provider adapter changes) | Medium — depends on model JSON compliance |
| 3. Bridge caps | Small (config + maybe truncation logic) | Low |
| 4. E2E run | Medium (iterative debugging) | Medium — real models are unpredictable |
| 5. Zap integration | Medium (new CLI command + skill update) | Low |
Total: ~1-2 focused sessions.
Open Questions
- Which model tier to use for council agents? Start with
default(cheapest), upgrade after confirmed working. - Should we keep the scaffold system or skip it for now? Recommendation: skip (
scaffold_pathunset), use system prompts only. - Do we need the writer agents? Recommendation: skip for v1, the meta arbiter output is sufficient.
TODO (from earlier council skill work)
- Revisit subagent personality depth
- Revisit skill name ("council")
- Consider unifying debate and parallel flows
- Experiment with 2-round sufficiency
- Test with different model tiers for advisors vs referee