From 0180d4fb8f152c3590066c9cf9f79b52fabc1e81 Mon Sep 17 00:00:00 2001 From: William Valentin Date: Fri, 6 Feb 2026 13:17:51 -0800 Subject: [PATCH] docs: add Phase 0/1 implementation plan and feature gap analysis --- ...026-02-06-openclaw-feature-gap-analysis.md | 306 +++++++ .../2026-02-06-p0-p1-implementation-plan.md | 845 ++++++++++++++++++ 2 files changed, 1151 insertions(+) create mode 100644 docs/plans/2026-02-06-openclaw-feature-gap-analysis.md create mode 100644 docs/plans/2026-02-06-p0-p1-implementation-plan.md diff --git a/docs/plans/2026-02-06-openclaw-feature-gap-analysis.md b/docs/plans/2026-02-06-openclaw-feature-gap-analysis.md new file mode 100644 index 0000000..a8c9de2 --- /dev/null +++ b/docs/plans/2026-02-06-openclaw-feature-gap-analysis.md @@ -0,0 +1,306 @@ +# Flynn vs OpenClaw — Feature Gap Analysis + +**Date:** 2026-02-06 +**Purpose:** Comprehensive comparison of Flynn's current implementation against OpenClaw's feature set, to guide prioritisation of future work. + +## Legend + +- **MATCH** — Flynn has equivalent functionality +- **PARTIAL** — Flynn has some implementation but incomplete +- **MISSING** — Not implemented in Flynn + +--- + +## 1. Channels / Frontends + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Telegram | grammY bot | grammY bot | **MATCH** | +| WhatsApp | Baileys (WhatsApp Web) | -- | **MISSING** | +| Discord | discord.js | -- | **MISSING** | +| Slack | Bolt SDK | -- | **MISSING** | +| Signal | signal-cli | -- | **MISSING** | +| iMessage / BlueBubbles | imsg + BlueBubbles | -- | **MISSING** | +| Google Chat | Chat API | -- | **MISSING** | +| Microsoft Teams | Bot Framework | -- | **MISSING** | +| Matrix | Extension | -- | **MISSING** | +| Zalo / Zalo Personal | Extension | -- | **MISSING** | +| WebChat | Gateway-served | Gateway (stub) | **PARTIAL** | +| TUI (terminal) | `openclaw tui` | Minimal + Fullscreen (React/Ink) | **MATCH** | +| LINE / Feishu / Mattermost | Extensions/plugins | -- | **MISSING** | + +Flynn has **2 of ~15 channels**. The messaging channel ecosystem is the single biggest gap. + +--- + +## 2. Model Providers + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Anthropic (Claude) | Full + OAuth | Full | **MATCH** | +| OpenAI | Full + OAuth + Codex | Full | **MATCH** | +| Ollama (local) | Supported | Full | **MATCH** | +| Llama.cpp (local) | Supported | Basic | **PARTIAL** | +| Gemini / Google | Full provider | Stub only | **PARTIAL** | +| OpenRouter | Supported | -- | **MISSING** | +| Amazon Bedrock | Supported | -- | **MISSING** | +| GLM / MiniMax / Moonshot | Supported | -- | **MISSING** | +| Vercel AI Gateway | Supported | -- | **MISSING** | +| Z.AI | Supported | -- | **MISSING** | +| Synthetic provider | Supported | -- | **MISSING** | +| OAuth subscription auth | Anthropic + OpenAI | API keys only | **MISSING** | +| Model failover chains | Full (fallback + rotation) | Fallback chains | **MATCH** | +| Model tier routing | Per-agent, per-provider | default/fast/complex/local | **MATCH** | +| Provider-specific tool policy | Per-provider tool filtering | -- | **MISSING** | + +--- + +## 3. Agent Runtime & Tools + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Tool loop with streaming | RPC mode + block streaming | Tool loop (max 10 iter) | **MATCH** | +| `exec` / shell | Full (background, pty, timeout, elevated) | Basic (bash -c, timeout) | **PARTIAL** | +| `read` / file read | Full (line ranges) | Full (line offset/limit) | **MATCH** | +| `write` / file write | Full | Full (auto-mkdir) | **MATCH** | +| `edit` / file edit | Full | Full (exact match, replace_all) | **MATCH** | +| `apply_patch` | Multi-hunk structured patches | -- | **MISSING** | +| `file.list` / glob | -- | Full (glob filtering) | **MATCH** | +| `web_fetch` | Full (markdown/text extract, caching) | Basic HTTP GET | **PARTIAL** | +| `web_search` | Brave Search API | -- | **MISSING** | +| Browser control | Full CDP (Chromium profiles, snapshots, actions) | -- | **MISSING** | +| Canvas / A2UI | Agent-driven visual workspace | -- | **MISSING** | +| `process` tool | Background exec management (poll/log/write/kill) | -- | **MISSING** | +| `image` tool | Image analysis with configurable model | -- | **MISSING** | +| `message` tool | Cross-channel messaging + actions | -- | **MISSING** | +| `cron` tool | Runtime cron management | -- | **MISSING** | +| `gateway` tool | Restart/config management | -- | **MISSING** | +| `sessions_*` tools | List/history/send/spawn across sessions | -- | **MISSING** | +| `agents_list` tool | Sub-agent discovery | -- | **MISSING** | +| Tool profiles | minimal/coding/messaging/full | -- | **MISSING** | +| Tool groups | `group:fs`, `group:runtime`, etc. | -- | **MISSING** | +| Tool allow/deny lists | Global + per-agent + per-provider | -- | **MISSING** | + +--- + +## 4. Session Management + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Session persistence | JSONL files | SQLite | **MATCH** (different storage) | +| Session isolation | Per-sender + group isolation | `{frontend}:{userId}` | **MATCH** | +| Session transfer | Between channels | Between frontends | **MATCH** | +| Multi-agent routing | Isolated workspaces per agent | Single backend | **MISSING** | +| Session pruning | Tool result trimming (in-memory) | -- | **MISSING** | +| `/new` / `/reset` | Full | Full | **MATCH** | +| `/status` | Full (model + tokens + cost) | Full (model + confirmations) | **MATCH** | + +--- + +## 5. Context Window & Compaction + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Auto-compaction | Full (summarise older history) | -- | **MISSING** | +| Manual `/compact` | Full (with instructions) | -- | **MISSING** | +| Pre-compaction memory flush | Silent agentic turn | -- | **MISSING** | +| Token tracking | Full (per-response, cost) | Input/output counters | **PARTIAL** | + +**Critical gap** — without compaction, long conversations will hit token limits and fail. + +--- + +## 6. Memory System + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Markdown memory files | `MEMORY.md` + daily logs | -- | **MISSING** | +| `memory_search` tool | Semantic vector search | -- | **MISSING** | +| `memory_get` tool | Read memory files | -- | **MISSING** | +| Vector embeddings | OpenAI/Gemini/local | -- | **MISSING** | +| Hybrid search (BM25 + vector) | Full | -- | **MISSING** | +| Session memory indexing | Experimental | -- | **MISSING** | +| QMD backend | Experimental | -- | **MISSING** | + +OpenClaw has a sophisticated memory system. Flynn has none. + +--- + +## 7. MCP (Model Context Protocol) + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| MCP tool servers | Not emphasised | Full (stdio transport) | **MATCH** | +| MCP tool bridging | Not emphasised | Full (`mcp:{server}:{tool}`) | **MATCH** | +| MCP server lifecycle | Not emphasised | Full (start/stop/restart) | **MATCH** | + +Flynn actually has MCP support that OpenClaw doesn't emphasise — OpenClaw relies on its own native tool system and plugins instead. + +--- + +## 8. Security & Safety + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Tool confirmation hooks | Full | Full (confirm/log/silent patterns) | **MATCH** | +| Chat ID allowlists | Per-channel | Telegram only | **PARTIAL** | +| DM pairing (unknown senders) | Full (pairing codes) | -- | **MISSING** | +| Docker sandboxing | Full (per-session/agent/shared) | -- | **MISSING** | +| Elevated mode | Host exec escape hatch | -- | **MISSING** | +| Tool execution timeouts | Full (configurable) | 30s default | **MATCH** | +| Output truncation | Full | 51KB | **MATCH** | +| Gateway auth (token/password) | Full | -- | **MISSING** | + +--- + +## 9. Automation & Scheduling + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Cron jobs | Full (runtime + config) | Full (YAML config) | **MATCH** | +| Webhooks | Full (inbound triggers) | -- | **MISSING** | +| Gmail Pub/Sub | Full | -- | **MISSING** | +| Heartbeat | Full | -- | **MISSING** | + +--- + +## 10. Apps & Companion Devices + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| macOS menu bar app | Full | -- | **MISSING** | +| iOS node | Full (Canvas, Voice, Camera) | -- | **MISSING** | +| Android node | Full (Canvas, Talk, Camera) | -- | **MISSING** | +| Voice Wake / Talk Mode | Full (ElevenLabs) | -- | **MISSING** | +| Camera / screen capture | Via nodes | -- | **MISSING** | +| Location access | Via nodes | -- | **MISSING** | + +--- + +## 11. Skills & Plugins + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Skills system | Bundled/managed/workspace | Bundled/managed/workspace | **MATCH** | +| Skill manifest | Full | Full (requirements, versioning) | **MATCH** | +| ClawHub registry | Community skill registry | -- | **MISSING** | +| Plugin system | Full (register tools + CLI commands) | -- | **MISSING** | +| Workspace prompt injection | AGENTS.md, SOUL.md, TOOLS.md | -- | **MISSING** | + +--- + +## 12. Gateway & Infrastructure + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| WebSocket control plane | Full | WebSocket gateway (basic) | **PARTIAL** | +| Control UI (web dashboard) | Full | -- | **MISSING** | +| Tailscale Serve/Funnel | Full integration | -- | **MISSING** | +| Remote gateway access | SSH tunnels + tailnet | -- | **MISSING** | +| Health checks / doctor | 10+ checks | 10 checks | **MATCH** | +| `onboard` wizard | Full guided setup | -- | **MISSING** | +| Docker deployment | Full | -- | **MISSING** | +| Nix deployment | Full | -- | **MISSING** | +| Fly.io / Railway / Render | Supported | -- | **MISSING** | +| Bonjour/mDNS discovery | Full | -- | **MISSING** | +| Gateway lock | Full | -- | **MISSING** | + +--- + +## 13. Chat Commands + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| `/status` | Full | Full | **MATCH** | +| `/new` / `/reset` | Full | Full | **MATCH** | +| `/compact` | Full | -- | **MISSING** | +| `/think ` | Full (off to xhigh) | -- | **MISSING** | +| `/verbose` | Full | -- | **MISSING** | +| `/usage` | Full (off/tokens/full) | -- | **MISSING** | +| `/local` / `/cloud` | -- | Full | Flynn-unique | +| `/model` | -- | Full | Flynn-unique | + +--- + +## 14. Miscellaneous + +| Feature | OpenClaw | Flynn | Status | +|---------|----------|-------|--------| +| Streaming & chunking | Full (per-channel limits) | Full (streaming responses) | **MATCH** | +| Typing indicators | Full | Telegram only | **PARTIAL** | +| Presence tracking | Full | -- | **MISSING** | +| Usage tracking / cost | Full | Basic token counters | **PARTIAL** | +| Markdown rendering | Per-channel formatting | Basic (TUI + Telegram) | **PARTIAL** | +| Media pipeline | Images/audio/video/transcription | -- | **MISSING** | +| Group chat support | Full (mention gating, routing) | -- | **MISSING** | +| Retry policy | Full (configurable) | -- | **MISSING** | +| System prompt templating | AGENTS.md, SOUL.md, IDENTITY.md, USER.md | -- | **MISSING** | + +--- + +## Summary Scorecard + +| Category | Compared | Match | Partial | Missing | +|----------|:--------:|:-----:|:-------:|:-------:| +| Channels | 15 | 2 | 1 | 12 | +| Model Providers | 14 | 5 | 2 | 7 | +| Agent & Tools | 17 | 4 | 2 | 11 | +| Sessions | 7 | 5 | 0 | 2 | +| Context/Compaction | 4 | 0 | 1 | 3 | +| Memory | 7 | 0 | 0 | 7 | +| MCP | 3 | 3 | 0 | 0 | +| Security | 8 | 3 | 1 | 4 | +| Automation | 4 | 1 | 0 | 3 | +| Companion Apps | 6 | 0 | 0 | 6 | +| Skills/Plugins | 5 | 2 | 0 | 3 | +| Gateway/Infra | 11 | 1 | 1 | 9 | +| Chat Commands | 8 | 2 | 0 | 4 | +| Misc | 9 | 1 | 3 | 5 | +| **TOTAL** | **118** | **29 (25%)** | **11 (9%)** | **78 (66%)** | + +--- + +## Top Priority Gaps (recommended order) + +### P0 — Functionally Critical + +1. **Context compaction** — Without this, long conversations hit token limits and break. Blocks real-world use for extended sessions. + +2. **Memory system** — OpenClaw's markdown-based memory with vector search gives the assistant persistent knowledge across sessions. Flynn has nothing persistent beyond session history. + +### P1 — High Impact + +3. **Messaging channels (WhatsApp, Discord, Slack)** — Flynn has 2 of 15 channels. Adding the top 3 popular channels covers the majority of use cases. + +4. **Web search tool** — `web_search` (Brave API) is a commonly-used agent capability Flynn lacks entirely. + +5. **Background exec / process management** — OpenClaw's `process` tool lets agents manage long-running commands. Flynn's shell tool is fire-and-forget. + +6. **Enhanced `web_fetch`** — Flynn's is basic HTTP GET; OpenClaw extracts markdown/text, caches responses, and handles JS-heavy sites via browser fallback. + +### P2 — Important for Production + +7. **Docker sandboxing** — Tool isolation for non-main sessions. Important for any multi-user or group-facing deployment. + +8. **Multi-agent routing** — Isolated agents per workspace/sender with sub-agent spawning. + +9. **Tool allow/deny and profiles** — Fine-grained control over which tools each agent/session can use. + +10. **System prompt templating** — AGENTS.md, SOUL.md, IDENTITY.md, USER.md workspace injection for personality and behaviour customisation. + +### P3 — Nice to Have + +11. **Browser control (CDP)** — Powerful but complex; depends on use case. +12. **Gemini provider (full)** — Currently a stub. +13. **Additional model providers** — OpenRouter, Bedrock, etc. +14. **Gateway auth** — Token/password auth for the WebSocket control plane. +15. **Companion apps** — macOS/iOS/Android nodes (huge scope, niche audience). + +--- + +## What Flynn Has That OpenClaw Doesn't Emphasise + +- **Full MCP protocol support** with stdio transport, tool bridging, and server lifecycle management +- **Model tier switching** via chat commands (`/local`, `/cloud`, `/model`) +- **Gemini provider** (stub, but in the schema — OpenClaw removed non-Pi agent paths) +- **SQLite session storage** (vs OpenClaw's JSONL files) diff --git a/docs/plans/2026-02-06-p0-p1-implementation-plan.md b/docs/plans/2026-02-06-p0-p1-implementation-plan.md new file mode 100644 index 0000000..ed8aec9 --- /dev/null +++ b/docs/plans/2026-02-06-p0-p1-implementation-plan.md @@ -0,0 +1,845 @@ +# Flynn P0 + P1 Implementation Plan + +**Date:** 2026-02-06 +**Scope:** 7 features from the gap analysis — the functionally critical (P0) and high-impact (P1) items. +**Prerequisite:** [Feature Gap Analysis](./2026-02-06-openclaw-feature-gap-analysis.md) + +--- + +## Feature Summary + +| # | Feature | Priority | Est. Effort | Dependencies | +|---|---------|----------|-------------|--------------| +| 0 | Multi-model sub-agent delegation | P0 | 3–4 days | None (foundational) | +| 1 | Context compaction | P0 | 2–3 days | #0 (uses cheap model for summaries) | +| 2 | Memory system | P0 | 3–4 days | #0, #1 | +| 3 | Messaging channels (WhatsApp, Discord, Slack) | P1 | 2–3 days each | None | +| 4 | Web search tool | P1 | 0.5 day | None | +| 5 | Background exec / process management | P1 | 1–2 days | None | +| 6 | Enhanced web_fetch | P1 | 1 day | None | + +**Total estimated effort:** 15–22 days + +--- + +## Phase 0: Multi-Model Sub-Agent Delegation (P0 — Foundational) + +### Problem + +Flynn currently runs a **single NativeAgent per session** that talks to one model tier at a time. The `ModelRouter` (`src/models/router.ts`) supports tiers (`fast`/`default`/`complex`/`local`) and a fallback chain, but: + +- There is no concept of **sub-agents** — the primary agent can't spawn a cheaper model for a subtask. +- Model selection is **per-session** (via `/model` command), not **per-task**. +- Compaction summaries, memory extraction, and classification tasks all use the same expensive model as the main conversation — wasteful. +- There is no orchestrator pattern where an expensive model (Opus) plans and delegates to cheaper models (Sonnet, Haiku) for execution. + +### Model Tier Mapping + +| Tier | Model | Use For | +|------|-------|---------| +| **complex** (orchestrator) | Claude Opus 4.6 | Planning, orchestration, complex reasoning, multi-step decisions | +| **default** (worker) | Claude Sonnet 4.5 | General conversation, tool use, code generation, channel adapters | +| **fast** (utility) | Claude Haiku 4.5 | Compaction summaries, memory extraction, classification, keyword extraction, formatting | + +This maps directly to Flynn's existing `ModelTier` type. The infrastructure is already there — what's missing is the **delegation mechanism**. + +### Design + +#### Sub-agent spawning + +Add the ability for `NativeAgent` to spawn **ephemeral sub-agents** that run a single task on a specific model tier and return the result: + +```typescript +interface SubAgentRequest { + /** Which model tier to use for this subtask. */ + tier: ModelTier; + /** System prompt for the sub-agent (task-specific). */ + systemPrompt: string; + /** The task message. */ + message: string; + /** Max tokens for the response. */ + maxTokens?: number; + /** Whether to include tools. Default: false (most subtasks are pure text). */ + tools?: boolean; +} + +interface SubAgentResult { + content: string; + usage: TokenUsage; + tier: ModelTier; +} +``` + +The sub-agent is **stateless** — no session, no history, just a single request/response. It's a thin wrapper around `modelRouter.chat()` with a specific tier. + +#### Where delegation happens + +| Task | Delegated to | Reason | +|------|-------------|--------| +| Compaction summary | **fast** (Haiku) | Summarisation is a well-defined extraction task; doesn't need complex reasoning | +| Memory fact extraction | **fast** (Haiku) | Simple extraction from conversation text | +| Message classification | **fast** (Haiku) | "Is this a command, question, or statement?" — trivial | +| Tool result summarisation | **fast** (Haiku) | Condense verbose tool output before feeding back | +| Primary conversation | **default** (Sonnet) | General-purpose agent work | +| Complex planning/reasoning | **complex** (Opus) | Multi-step planning, architecture decisions, ambiguous requests | +| Sub-agent orchestration | **complex** (Opus) | When the agent decides to break a task into subtasks | + +#### Automatic tier escalation + +Add optional **auto-escalation** where the primary agent (Sonnet) can recognise it's struggling and escalate to Opus: + +1. If the agent hits `maxIterations` without completing the task → escalate to `complex`. +2. If the agent's response contains explicit uncertainty markers ("I'm not sure", "This is beyond...") → offer escalation. +3. Configurable: `auto_escalate: true` in config. + +This is a **future enhancement** — start with explicit delegation points (compaction, memory extraction) and add auto-escalation later. + +#### AgentOrchestrator class + +Create a new `AgentOrchestrator` that sits between the channel message handler and the `NativeAgent`: + +```typescript +class AgentOrchestrator { + private primaryAgent: NativeAgent; // default tier (Sonnet) + private modelRouter: ModelRouter; + + /** Spawn a sub-agent for a single-turn task on a specific tier. */ + async delegate(request: SubAgentRequest): Promise; + + /** Process a user message — delegates to primary agent, which may internally delegate subtasks. */ + async process(userMessage: string): Promise; +} +``` + +The orchestrator replaces the current direct `NativeAgent` usage in the message router (`src/daemon/index.ts:139-186`). + +#### Passing the orchestrator to tools and compaction + +The key insight: **compaction and memory extraction don't need a new agent class** — they just need access to `modelRouter.chat(request, 'fast')`. The orchestrator provides a `delegate()` method that any subsystem can call: + +```typescript +// In compaction.ts +const summary = await orchestrator.delegate({ + tier: 'fast', + systemPrompt: COMPACTION_SYSTEM_PROMPT, + message: `Summarise this conversation:\n\n${messagesToCompact}`, + maxTokens: 1024, +}); + +// In memory extraction +const facts = await orchestrator.delegate({ + tier: 'fast', + systemPrompt: MEMORY_EXTRACTION_PROMPT, + message: `Extract key facts from:\n\n${summary}`, + maxTokens: 512, +}); +``` + +### New files + +| File | Purpose | +|------|---------| +| `src/backends/native/orchestrator.ts` | `AgentOrchestrator` — sub-agent spawning and delegation | +| `src/backends/native/prompts.ts` | System prompts for delegated tasks (compaction, extraction, classification) | + +### Changes to existing files + +| File | Change | +|------|--------| +| `src/backends/native/agent.ts` | Accept optional `orchestrator` reference for internal delegation. Add `delegateSubtask()` method. | +| `src/daemon/index.ts` | Replace direct `NativeAgent` creation in `createMessageRouter()` with `AgentOrchestrator`. | +| `src/config/schema.ts` | Add `agents` config block for tier assignment and delegation policy. | +| `src/models/router.ts` | No changes needed — already supports `chat(request, tier)`. | + +### Config additions + +```yaml +agents: + primary_tier: default # Model tier for main conversation (Sonnet) + delegation: + compaction: fast # Tier for compaction summaries (Haiku) + memory_extraction: fast # Tier for memory fact extraction (Haiku) + classification: fast # Tier for message classification (Haiku) + tool_summarisation: fast # Tier for condensing tool output (Haiku) + complex_reasoning: complex # Tier for escalated reasoning (Opus) + auto_escalate: false # Future: auto-escalate on failure + max_delegation_depth: 3 # Prevent infinite delegation chains +``` + +### Implementation steps + +1. Create `src/backends/native/orchestrator.ts`: + - Constructor takes `ModelRouter`, `systemPrompt`, `session`, `toolRegistry`, `toolExecutor`, delegation config. + - `delegate(request: SubAgentRequest): Promise` — single-turn call to `modelRouter.chat()` with specified tier. + - `process(userMessage: string): Promise` — delegates to internal `NativeAgent`. + - Tracks delegation depth to prevent loops. + - Logs tier usage for cost visibility. +2. Create `src/backends/native/prompts.ts` with task-specific system prompts. +3. Update `createMessageRouter()` in `src/daemon/index.ts` to use `AgentOrchestrator` instead of raw `NativeAgent`. +4. Add `agents` config block to schema. +5. Wire delegation config through to compaction (Phase 1) and memory (Phase 2). +6. Tests: delegation routing, tier selection, depth limiting. + +### Cost implications + +| Operation | Without delegation | With delegation | +|-----------|-------------------|-----------------| +| Compaction summary | Opus/Sonnet ($$$) | Haiku ($) | +| Memory extraction | Opus/Sonnet ($$$) | Haiku ($) | +| 10 classifications | Opus/Sonnet ($$$) | Haiku ($) | +| Complex reasoning | Sonnet ($$) | Opus ($$$) — but only when needed | + +Net effect: **significant cost reduction** for background tasks, with targeted spend on complex reasoning only when it matters. + +--- + +## Phase 1: Context Compaction (P0) + +### Problem + +Flynn sends the **entire session history** to the model on every turn. There is no summarisation, trimming, or token budgeting. Once a conversation exceeds the model's context window, it fails hard. + +**Current flow** (`src/backends/native/agent.ts:92-165`): +``` +toolLoop() → loopMessages = full this.history → send to model +``` + +The `SessionStore` (`src/session/store.ts`) and `ManagedSession` (`src/session/manager.ts`) store every message verbatim and replay them all on load. + +### Design + +#### Token counting + +Add a `tokenCount` utility that estimates token counts per message. Two strategies: + +1. **Cheap estimate** — character-based heuristic (`chars / 4` for English). Good enough for budgeting. +2. **Accurate count** — use the Anthropic SDK's `count_tokens` or `tiktoken` for OpenAI. Only needed if we want precise billing. + +Start with the cheap estimate; add accurate counting later behind a flag. + +#### Compaction strategy + +Use a **summarise-and-replace** approach (same as OpenClaw): + +1. When total estimated tokens exceed a **compaction threshold** (configurable, default: 80% of model's context window), trigger compaction. +2. Take all messages **except the last N turns** (configurable, default: 4 turns). +3. **Delegate** the summarisation request to the **fast tier (Haiku)** via `orchestrator.delegate()`: "Summarise this conversation so far, preserving key facts, decisions, and context." This is a well-defined extraction task that doesn't need complex reasoning. +4. Replace the older messages with a single `[system_summary]` message. +5. Persist the compacted history to SQLite (replace the old messages). + +#### Where compaction runs + +Compaction is a concern of `AgentOrchestrator` (Phase 0), not the session store. The orchestrator decides when to compact based on the model it's using, and delegates the summary generation to the **fast** tier via `orchestrator.delegate({ tier: 'fast', ... })`. + +#### New files + +| File | Purpose | +|------|---------| +| `src/context/tokens.ts` | Token estimation utilities | +| `src/context/compaction.ts` | Compaction logic (summarise + replace) | + +#### Changes to existing files + +| File | Change | +|------|--------| +| `src/backends/native/agent.ts` | Add `compactIfNeeded()` call before building `loopMessages`. Add compaction config to `NativeAgentConfig`. | +| `src/session/manager.ts` | Add `ManagedSession.replaceHistory(messages)` method for compaction to persist the compacted state. | +| `src/session/store.ts` | Add `replaceMessages(sessionId, messages)` — atomic delete + re-insert in a transaction. | +| `src/models/types.ts` | Add optional `contextWindow` field to `ChatResponse` or create a `ModelCapabilities` type. | +| `src/config/schema.ts` | Add `compaction` config block: `{ enabled, threshold_pct, keep_turns, summary_model? }`. | +| `src/daemon/index.ts` | Pass compaction config to agent creation. | + +#### Config additions + +```yaml +compaction: + enabled: true + threshold_pct: 80 # Trigger at 80% of context window + keep_turns: 4 # Always keep the last 4 exchanges + # summary_tier is configured in agents.delegation.compaction (default: fast/Haiku) +``` + +#### Chat commands + +| Command | Description | +|---------|-------------| +| `/compact` | Force compaction of the current session immediately. | + +#### Implementation steps + +1. Create `src/context/tokens.ts` with `estimateTokens(text: string): number` and `estimateMessageTokens(messages: Message[]): number`. +2. Create `src/context/compaction.ts` with `compactHistory(opts: CompactionOpts): Promise`: + - Takes messages, orchestrator (for delegation), keep_turns. + - Calls `orchestrator.delegate({ tier: 'fast', ... })` for the summary. + - Returns `[summaryMessage, ...recentMessages]`. +3. Add `replaceMessages()` to `SessionStore`. +4. Add `replaceHistory()` to `ManagedSession`. +5. Add compaction config to schema. +6. Wire `compactIfNeeded()` into `AgentOrchestrator.process()` — called before building the request, checks token budget. +7. Add `/compact` command handling in the message router. +8. Tests: token estimation accuracy, compaction trigger logic, history replacement, delegation to fast tier. + +#### Model context window sizes + +Hard-code a lookup table in `src/context/tokens.ts`: + +```typescript +const CONTEXT_WINDOWS: Record = { + 'claude-sonnet-4-20250514': 200_000, + 'claude-3-5-haiku-20241022': 200_000, + 'gpt-4o': 128_000, + 'gpt-4o-mini': 128_000, + // ... etc +}; +``` + +Allow override in config: `models.default.context_window: 128000`. + +--- + +## Phase 2: Memory System (P0) + +### Problem + +Flynn has no persistent knowledge across sessions. Every new session starts blank. The agent can't remember user preferences, past decisions, or accumulated knowledge. + +### Design + +A lightweight memory system with three layers: + +1. **Memory files** — Markdown files that the agent can read/write (like OpenClaw's `MEMORY.md`). +2. **Memory tools** — `memory.read`, `memory.write`, `memory.search` builtin tools. +3. **Auto-indexing** — After compaction, key facts are extracted and appended to memory. + +#### Storage + +Use a dedicated SQLite table in the existing `sessions.db` (or a separate `memory.db`): + +```sql +CREATE TABLE memory_entries ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + session_id TEXT, -- NULL for global memories + namespace TEXT NOT NULL, -- 'user', 'facts', 'preferences', etc. + key TEXT NOT NULL, + content TEXT NOT NULL, + embedding BLOB, -- Future: vector embedding for search + created_at INTEGER NOT NULL DEFAULT (unixepoch()), + updated_at INTEGER NOT NULL DEFAULT (unixepoch()) +); +CREATE INDEX idx_memory_ns ON memory_entries(namespace); +CREATE INDEX idx_memory_session ON memory_entries(session_id); +``` + +#### Phase 2a: File-based memory (MVP) + +The simplest useful memory: a markdown file per namespace in `~/.local/share/flynn/memory/`. + +``` +~/.local/share/flynn/memory/ +├── global.md # Cross-session knowledge +├── user.md # User preferences, facts about the user +└── sessions/ + └── {session_id}.md # Per-session notes +``` + +#### Memory tools + +| Tool | Description | +|------|-------------| +| `memory.read` | Read a memory file by namespace. Args: `{ namespace: string }` | +| `memory.write` | Append to or replace a memory file. Args: `{ namespace: string, content: string, mode: 'append' \| 'replace' }` | +| `memory.search` | Search across all memory files for a keyword. Args: `{ query: string }`. Returns matching lines with context. | + +#### Phase 2b: Vector search (future) + +Defer vector embeddings and semantic search to a later phase. The file-based approach with keyword search covers 80% of use cases. + +When implemented: +- Add `sqlite-vec` or similar for vector storage +- Embed memory entries on write using the configured model's embedding API +- Hybrid search: keyword (BM25) + vector similarity + +#### System prompt integration + +On every agent turn, inject a `[Memory Context]` section into the system prompt: + +``` +# Memory Context + +The following is your persistent memory. Use it to maintain continuity across sessions. + +## User +{contents of user.md, truncated to ~1000 tokens} + +## Global +{contents of global.md, truncated to ~1000 tokens} +``` + +This is injected dynamically by the agent before each request, not baked into the static system prompt. + +#### Auto-extraction after compaction + +When compaction runs (Phase 1), add a follow-up step using the **fast tier (Haiku)** via `orchestrator.delegate()`: + +1. Along with the summary, delegate to Haiku to extract any **new facts worth remembering** (user preferences, decisions, names, etc.). This is a simple extraction task — no need for Sonnet/Opus. +2. Append extracted facts to `user.md` or `global.md`. + +This creates a natural knowledge accumulation loop: conversation → compaction (Haiku) → memory extraction (Haiku) → next session gets richer context. + +The cost of these background operations is minimal since they run on the cheapest model tier. + +#### New files + +| File | Purpose | +|------|---------| +| `src/memory/store.ts` | MemoryStore class — read/write/search markdown files | +| `src/memory/index.ts` | Exports | +| `src/tools/builtin/memory-read.ts` | `memory.read` tool | +| `src/tools/builtin/memory-write.ts` | `memory.write` tool | +| `src/tools/builtin/memory-search.ts` | `memory.search` tool | + +#### Changes to existing files + +| File | Change | +|------|--------| +| `src/tools/builtin/index.ts` | Register memory tools in `allBuiltinTools` | +| `src/backends/native/orchestrator.ts` | Inject memory context into system prompt before each request | +| `src/context/compaction.ts` | Add memory extraction step after summarisation (delegates to fast tier) | +| `src/daemon/index.ts` | Initialize MemoryStore, pass to orchestrator config | +| `src/config/schema.ts` | Add `memory` config block: `{ enabled, dir, namespaces, auto_extract }` | + +#### Config additions + +```yaml +memory: + enabled: true + dir: ~/.local/share/flynn/memory + auto_extract: true # Extract facts during compaction + max_context_tokens: 2000 # Max tokens injected per turn from memory +``` + +#### Implementation steps + +1. Create `src/memory/store.ts`: + - `read(namespace): string` — read file contents + - `write(namespace, content, mode): void` — append or replace + - `search(query): SearchResult[]` — line-by-line keyword match with context + - `listNamespaces(): string[]` +2. Create memory tools (3 files). +3. Register tools. +4. Add memory context injection to `NativeAgent` — load memory before building the request, inject into system prompt. +5. Add memory extraction to compaction flow. +6. Tests: memory CRUD, search, injection, extraction. + +--- + +## Phase 3: Messaging Channels (P1) + +### Problem + +Flynn has only Telegram and WebChat. The three most requested channels are WhatsApp, Discord, and Slack. + +### Design approach + +Flynn's `ChannelAdapter` interface (`src/channels/types.ts:51-69`) is clean and well-defined. Adding a new channel means: + +1. Implement `ChannelAdapter` (5 methods: `name`, `status`, `connect()`, `disconnect()`, `send()`, `onMessage()`). +2. Add config section. +3. Register in daemon startup. + +Each channel is independent — implement in any order. + +### 3a: Discord + +**Library:** `discord.js` v14 +**Effort:** 1–2 days + +#### Config + +```yaml +discord: + bot_token: ${DISCORD_BOT_TOKEN} + allowed_guild_ids: [] # Empty = all guilds + allowed_channel_ids: [] # Empty = all channels +``` + +#### New files + +| File | Purpose | +|------|---------| +| `src/channels/discord/adapter.ts` | DiscordAdapter implementing ChannelAdapter | +| `src/channels/discord/index.ts` | Exports | + +#### Key decisions + +- **Peer ID:** Use `channelId` (not `userId`) so the agent maintains separate sessions per Discord channel. +- **Message chunking:** Discord has a 2000-char limit. Chunk long responses. +- **Mentions:** Only respond when mentioned (`@Flynn`) or in DMs. Configurable. +- **Slash commands:** Register `/reset` and `/status` as Discord slash commands. + +#### Implementation steps + +1. Add `discord.js` dependency. +2. Create `DiscordAdapter` class. +3. Add config schema for `discord` section. +4. Register in daemon if `config.discord.bot_token` is set. +5. Export from `src/channels/index.ts`. +6. Test with a bot in a private server. + +### 3b: Slack + +**Library:** `@slack/bolt` (Bolt for JavaScript) +**Effort:** 1–2 days + +#### Config + +```yaml +slack: + bot_token: ${SLACK_BOT_TOKEN} + app_token: ${SLACK_APP_TOKEN} # For Socket Mode + signing_secret: ${SLACK_SIGNING_SECRET} + allowed_channel_ids: [] +``` + +#### New files + +| File | Purpose | +|------|---------| +| `src/channels/slack/adapter.ts` | SlackAdapter implementing ChannelAdapter | +| `src/channels/slack/index.ts` | Exports | + +#### Key decisions + +- **Socket Mode** for self-hosted deployments (no public URL needed). Falls back to HTTP events if `app_token` not set. +- **Peer ID:** `channelId:threadTs` to isolate threaded conversations. +- **Message chunking:** Slack has a 40,000-char limit with blocks. Use `mrkdwn` formatting. +- **Slash commands:** `/flynn-reset`, `/flynn-status`. + +### 3c: WhatsApp + +**Library:** `whatsapp-web.js` (or `@whiskeysockets/baileys` for full WhatsApp Web protocol) +**Effort:** 2–3 days (more complex due to QR auth) + +#### Config + +```yaml +whatsapp: + auth_dir: ~/.local/share/flynn/whatsapp-auth + allowed_numbers: [] # E.164 format, empty = all +``` + +#### Key decisions + +- **Auth flow:** WhatsApp Web requires QR code scanning on first connect. Display QR in terminal on startup. +- **Session persistence:** Store auth state in `auth_dir` so re-auth isn't needed on restart. +- **Peer ID:** Phone number (E.164). +- **Media:** Start with text-only; defer image/audio handling. + +**WhatsApp is the most complex channel.** Consider doing Discord and Slack first, then WhatsApp. + +### Shared channel infrastructure + +Before implementing individual channels, extract any common patterns: + +1. **Message chunking utility** — `src/channels/utils/chunking.ts`: `chunkMessage(text: string, maxLen: number): string[]` +2. **Allowlist checking** — `src/channels/utils/auth.ts`: `isAllowed(senderId: string, allowlist: string[]): boolean` +3. **Markdown adaptation** — `src/channels/utils/markdown.ts`: Platform-specific markdown conversion (Discord uses different syntax from Telegram). + +--- + +## Phase 4: Web Search Tool (P1) + +### Problem + +The agent has no way to search the web. This is one of the most commonly-used agent tools. + +### Design + +#### Provider options + +| Provider | Pros | Cons | +|----------|------|------| +| **Brave Search API** | Free tier (2k/month), clean API, good results | Requires API key signup | +| **SearXNG** | Self-hosted, no API key, already running in homelab | Results quality varies | +| **Tavily** | Purpose-built for AI agents, great results | Paid only | +| **DuckDuckGo** | No API key needed | Unofficial API, rate limits | + +**Recommendation:** Support Brave as primary, SearXNG as self-hosted alternative. Make the provider configurable. + +#### Config + +```yaml +tools: + web_search: + provider: brave # brave | searxng | tavily + api_key: ${BRAVE_SEARCH_API_KEY} + endpoint: null # Override for SearXNG: http://searxng:8080 + max_results: 5 +``` + +#### New files + +| File | Purpose | +|------|---------| +| `src/tools/builtin/web-search.ts` | `web.search` tool | + +#### Tool interface + +```typescript +{ + name: 'web.search', + description: 'Search the web for information. Returns titles, URLs, and snippets.', + inputSchema: { + type: 'object', + properties: { + query: { type: 'string', description: 'Search query' }, + count: { type: 'number', description: 'Number of results (default 5, max 20)' }, + }, + required: ['query'], + }, +} +``` + +#### Output format + +``` +1. **Title** — url + Snippet text... + +2. **Title** — url + Snippet text... +``` + +Structured as markdown so the model can easily parse and reference results. + +#### Implementation steps + +1. Create `src/tools/builtin/web-search.ts`. +2. Add Brave Search API client (simple `fetch` — no SDK needed). +3. Add SearXNG support as alternative backend. +4. Add tool config section to schema. +5. Register in `allBuiltinTools`. +6. Tests: mock API responses, result formatting. + +--- + +## Phase 5: Background Exec / Process Management (P1) + +### Problem + +Flynn's `shell.exec` (`src/tools/builtin/shell.ts`) is fire-and-forget: it runs a command, waits for it to finish (up to 30s timeout), and returns stdout/stderr. There's no way to: + +- Run a long-running process (e.g., `npm run dev`) +- Check on a running process +- Read its ongoing output +- Kill it + +### Design + +Add a `process` tool family that manages background processes: + +| Tool | Description | +|------|-------------| +| `process.start` | Start a command in the background. Returns a process ID. | +| `process.status` | Check if a process is running, exited, or errored. | +| `process.output` | Read recent stdout/stderr from a background process. | +| `process.kill` | Kill a background process. | +| `process.list` | List all managed background processes. | + +#### Process manager + +Create a `ProcessManager` class that maintains a registry of spawned processes: + +```typescript +interface ManagedProcess { + id: string; + command: string; + cwd?: string; + pid: number; + status: 'running' | 'exited' | 'killed' | 'error'; + exitCode?: number; + outputBuffer: RingBuffer; // Last N bytes of combined stdout+stderr + startedAt: number; +} +``` + +#### Output buffering + +Use a ring buffer (circular buffer) to keep the last 64KB of output per process. This prevents memory leaks from long-running processes with verbose output. + +#### Safety + +- **Max processes:** Limit to 10 concurrent background processes. +- **Auto-cleanup:** Kill processes that have been running for more than 1 hour (configurable). +- **Shutdown cleanup:** Kill all managed processes on daemon shutdown. +- **Hook integration:** `process.start` should go through the confirmation engine (same as `shell.exec`). + +#### New files + +| File | Purpose | +|------|---------| +| `src/tools/builtin/process/manager.ts` | ProcessManager class | +| `src/tools/builtin/process/start.ts` | `process.start` tool | +| `src/tools/builtin/process/status.ts` | `process.status` tool | +| `src/tools/builtin/process/output.ts` | `process.output` tool | +| `src/tools/builtin/process/kill.ts` | `process.kill` tool | +| `src/tools/builtin/process/list.ts` | `process.list` tool | +| `src/tools/builtin/process/index.ts` | Exports | + +#### Changes to existing files + +| File | Change | +|------|--------| +| `src/tools/builtin/index.ts` | Register process tools | +| `src/daemon/index.ts` | Create ProcessManager, pass to tool constructors, register shutdown handler | +| `src/config/schema.ts` | Add `process` config: `{ max_concurrent, max_runtime_minutes, buffer_size }` | + +#### Implementation steps + +1. Implement `RingBuffer` utility (or use an npm package like `ringbufferjs`). +2. Create `ProcessManager` class with spawn, track, kill, cleanup methods. +3. Implement 5 process tools. +4. Register tools and wire shutdown cleanup. +5. Tests: spawn + kill lifecycle, output buffering, max process limits. + +--- + +## Phase 6: Enhanced web_fetch (P1) + +### Problem + +Flynn's `web.fetch` (`src/tools/builtin/web-fetch.ts:19-50`) is a bare `fetch()` call that returns raw HTML. This is nearly useless for LLMs — they need extracted text/markdown, not raw HTML with scripts and styles. + +### Design + +#### Enhancements + +1. **HTML-to-markdown extraction** — Strip scripts/styles, convert to markdown using `@mozilla/readability` + `turndown`. +2. **Format parameter** — Let the agent choose: `text`, `markdown` (default), or `html`. +3. **Response caching** — Cache fetched pages for 5 minutes to avoid redundant requests in tool loops. +4. **Redirect following** — Already handled by `fetch()`, but add a max redirect limit. +5. **Content type handling** — Return JSON prettified, plain text as-is, HTML converted. + +#### Libraries + +| Package | Purpose | +|---------|---------| +| `turndown` | HTML → Markdown converter | +| `linkedom` | Lightweight DOM implementation (for Readability) | +| `@mozilla/readability` | Extract article content from HTML | + +Using `linkedom` instead of `jsdom` — it's much lighter and sufficient for content extraction. + +#### Tool interface update + +```typescript +{ + name: 'web.fetch', + description: 'Fetch a URL and extract its content. Returns clean text/markdown by default, not raw HTML.', + inputSchema: { + type: 'object', + properties: { + url: { type: 'string', description: 'The URL to fetch' }, + format: { type: 'string', enum: ['markdown', 'text', 'html'], description: 'Output format (default: markdown)' }, + timeout: { type: 'number', description: 'Timeout in milliseconds (default 15000)' }, + }, + required: ['url'], + }, +} +``` + +#### Caching + +Simple in-memory cache with TTL: + +```typescript +const cache = new Map(); +const CACHE_TTL = 5 * 60 * 1000; // 5 minutes +``` + +#### Changes to existing files + +| File | Change | +|------|--------| +| `src/tools/builtin/web-fetch.ts` | Major rewrite — add extraction, caching, format parameter | + +#### Implementation steps + +1. Add `turndown`, `linkedom`, `@mozilla/readability` dependencies. +2. Create extraction pipeline: fetch → parse DOM → readability → turndown → clean markdown. +3. Add format parameter handling. +4. Add response caching. +5. Update tool description to reflect new capabilities. +6. Tests: extraction from sample HTML, caching behaviour, format handling. + +--- + +## Implementation Order + +``` +Week 1: Phase 0 (Multi-Model Delegation) ─────────────────────── P0 (foundational) +Week 2: Phase 1 (Context Compaction) ─────────────────────────── P0 (uses delegation) +Week 3: Phase 2 (Memory System) ──────────────────────────────── P0 (uses delegation) +Week 4: Phase 4 (Web Search) + Phase 6 (Enhanced web_fetch) ─── P1 (quick wins) +Week 5: Phase 5 (Process Management) ─────────────────────────── P1 +Week 6+: Phase 3 (Channels: Discord → Slack → WhatsApp) ──────── P1 +``` + +**Rationale:** +- **Delegation first** — Phase 0 is foundational. Compaction and memory both need to delegate subtasks to cheaper models. Building the orchestrator first means Phase 1 and 2 can use it immediately. +- Compaction and memory are sequential (memory extraction depends on compaction). +- Web search and enhanced web_fetch are small, independent, and immediately useful — do them as palate cleansers between the big features. +- Process management is self-contained. +- Channels are the largest body of work but each is independent — can be done in parallel or interleaved. + +### Model usage across all phases + +| Phase | Primary model (user-facing) | Delegated tasks | Delegation tier | +|-------|---------------------------|-----------------|-----------------| +| 0 | Sonnet (default) | Sub-agent infrastructure | N/A (infrastructure) | +| 1 | Sonnet (default) | Compaction summaries | Haiku (fast) | +| 2 | Sonnet (default) | Memory fact extraction | Haiku (fast) | +| 3 | Sonnet (default) | Message classification, markdown adaptation | Haiku (fast) | +| 4 | Sonnet (default) | None (direct API call) | N/A | +| 5 | Sonnet (default) | None | N/A | +| 6 | Sonnet (default) | None | N/A | + +Opus (complex) is reserved for **user-facing tasks** that require deep reasoning — it's never used for background operations. + +--- + +## Testing Strategy + +Each phase should include: + +1. **Unit tests** — Pure logic (token estimation, ring buffer, markdown extraction, memory search). +2. **Integration tests** — Tool execution with mocked model responses. +3. **Manual smoke test** — Run via TUI and Telegram to verify end-to-end. + +Key test files to create: + +| Test file | Covers | +|-----------|--------| +| `src/backends/native/orchestrator.test.ts` | Delegation routing, tier selection, depth limiting, cost tracking | +| `src/context/tokens.test.ts` | Token estimation accuracy | +| `src/context/compaction.test.ts` | Compaction trigger logic, summary replacement, fast-tier delegation | +| `src/memory/store.test.ts` | Memory CRUD, search | +| `src/tools/builtin/web-search.test.ts` | API mocking, result formatting | +| `src/tools/builtin/process/manager.test.ts` | Process lifecycle, cleanup | +| `src/tools/builtin/web-fetch.test.ts` | HTML extraction, caching | + +--- + +## Risk Assessment + +| Risk | Impact | Mitigation | +|------|--------|------------| +| Haiku summaries lose critical context vs Sonnet | High | Validate quality; use detailed extraction prompts; allow per-task tier override in config | +| Delegation depth spirals (agent delegates to agent that delegates...) | Medium | Hard limit `max_delegation_depth: 3`; sub-agents cannot spawn sub-agents | +| Fast tier unavailable (Haiku rate limit / outage) | Medium | Fallback to default tier for delegation; log the fallback cost increase | +| Compaction summaries lose critical context | High | Keep last 4 turns intact; allow user to adjust `keep_turns`; log what was compacted | +| Memory injection bloats system prompt | Medium | Hard cap on injected memory tokens; truncate oldest entries | +| WhatsApp auth flow is fragile | Medium | Defer WhatsApp to last; use battle-tested Baileys library | +| Brave Search free tier limits (2k/month) | Low | SearXNG as free self-hosted fallback | +| Background processes leak resources | Medium | Max process limit, auto-kill timeout, shutdown cleanup | +| HTML extraction fails on JS-heavy sites | Low | Accept graceful degradation; defer CDP/browser fallback to P3 |