Files
flynn/docs/plans/2026-02-06-p0-p1-implementation-plan.md

33 KiB
Raw Permalink Blame History

Flynn P0 + P1 Implementation Plan

Date: 2026-02-06 Scope: 7 features from the gap analysis — the functionally critical (P0) and high-impact (P1) items. Prerequisite: Feature Gap Analysis


Feature Summary

# Feature Priority Est. Effort Dependencies
0 Multi-model sub-agent delegation P0 34 days None (foundational)
1 Context compaction P0 23 days #0 (uses cheap model for summaries)
2 Memory system P0 34 days #0, #1
3 Messaging channels (WhatsApp, Discord, Slack) P1 23 days each None
4 Web search tool P1 0.5 day None
5 Background exec / process management P1 12 days None
6 Enhanced web_fetch P1 1 day None

Total estimated effort: 1522 days


Phase 0: Multi-Model Sub-Agent Delegation (P0 — Foundational)

Problem

Flynn currently runs a single NativeAgent per session that talks to one model tier at a time. The ModelRouter (src/models/router.ts) supports tiers (fast/default/complex/local) and a fallback chain, but:

  • There is no concept of sub-agents — the primary agent can't spawn a cheaper model for a subtask.
  • Model selection is per-session (via /model command), not per-task.
  • Compaction summaries, memory extraction, and classification tasks all use the same expensive model as the main conversation — wasteful.
  • There is no orchestrator pattern where an expensive model (Opus) plans and delegates to cheaper models (Sonnet, Haiku) for execution.

Model Tier Mapping

Tier Model Use For
complex (orchestrator) Claude Opus 4.6 Planning, orchestration, complex reasoning, multi-step decisions
default (worker) Claude Sonnet 4.5 General conversation, tool use, code generation, channel adapters
fast (utility) Claude Haiku 4.5 Compaction summaries, memory extraction, classification, keyword extraction, formatting

This maps directly to Flynn's existing ModelTier type. The infrastructure is already there — what's missing is the delegation mechanism.

Design

Sub-agent spawning

Add the ability for NativeAgent to spawn ephemeral sub-agents that run a single task on a specific model tier and return the result:

interface SubAgentRequest {
  /** Which model tier to use for this subtask. */
  tier: ModelTier;
  /** System prompt for the sub-agent (task-specific). */
  systemPrompt: string;
  /** The task message. */
  message: string;
  /** Max tokens for the response. */
  maxTokens?: number;
  /** Whether to include tools. Default: false (most subtasks are pure text). */
  tools?: boolean;
}

interface SubAgentResult {
  content: string;
  usage: TokenUsage;
  tier: ModelTier;
}

The sub-agent is stateless — no session, no history, just a single request/response. It's a thin wrapper around modelRouter.chat() with a specific tier.

Where delegation happens

Task Delegated to Reason
Compaction summary fast (Haiku) Summarisation is a well-defined extraction task; doesn't need complex reasoning
Memory fact extraction fast (Haiku) Simple extraction from conversation text
Message classification fast (Haiku) "Is this a command, question, or statement?" — trivial
Tool result summarisation fast (Haiku) Condense verbose tool output before feeding back
Primary conversation default (Sonnet) General-purpose agent work
Complex planning/reasoning complex (Opus) Multi-step planning, architecture decisions, ambiguous requests
Sub-agent orchestration complex (Opus) When the agent decides to break a task into subtasks

Automatic tier escalation

Add optional auto-escalation where the primary agent (Sonnet) can recognise it's struggling and escalate to Opus:

  1. If the agent hits maxIterations without completing the task → escalate to complex.
  2. If the agent's response contains explicit uncertainty markers ("I'm not sure", "This is beyond...") → offer escalation.
  3. Configurable: auto_escalate: true in config.

This is a future enhancement — start with explicit delegation points (compaction, memory extraction) and add auto-escalation later.

AgentOrchestrator class

Create a new AgentOrchestrator that sits between the channel message handler and the NativeAgent:

class AgentOrchestrator {
  private primaryAgent: NativeAgent;   // default tier (Sonnet)
  private modelRouter: ModelRouter;

  /** Spawn a sub-agent for a single-turn task on a specific tier. */
  async delegate(request: SubAgentRequest): Promise<SubAgentResult>;

  /** Process a user message — delegates to primary agent, which may internally delegate subtasks. */
  async process(userMessage: string): Promise<string>;
}

The orchestrator replaces the current direct NativeAgent usage in the message router (src/daemon/index.ts:139-186).

Passing the orchestrator to tools and compaction

The key insight: compaction and memory extraction don't need a new agent class — they just need access to modelRouter.chat(request, 'fast'). The orchestrator provides a delegate() method that any subsystem can call:

// In compaction.ts
const summary = await orchestrator.delegate({
  tier: 'fast',
  systemPrompt: COMPACTION_SYSTEM_PROMPT,
  message: `Summarise this conversation:\n\n${messagesToCompact}`,
  maxTokens: 1024,
});

// In memory extraction
const facts = await orchestrator.delegate({
  tier: 'fast',
  systemPrompt: MEMORY_EXTRACTION_PROMPT,
  message: `Extract key facts from:\n\n${summary}`,
  maxTokens: 512,
});

New files

File Purpose
src/backends/native/orchestrator.ts AgentOrchestrator — sub-agent spawning and delegation
src/backends/native/prompts.ts System prompts for delegated tasks (compaction, extraction, classification)

Changes to existing files

File Change
src/backends/native/agent.ts Accept optional orchestrator reference for internal delegation. Add delegateSubtask() method.
src/daemon/index.ts Replace direct NativeAgent creation in createMessageRouter() with AgentOrchestrator.
src/config/schema.ts Add agents config block for tier assignment and delegation policy.
src/models/router.ts No changes needed — already supports chat(request, tier).

Config additions

agents:
  primary_tier: default              # Model tier for main conversation (Sonnet)
  delegation:
    compaction: fast                  # Tier for compaction summaries (Haiku)
    memory_extraction: fast           # Tier for memory fact extraction (Haiku)
    classification: fast              # Tier for message classification (Haiku)
    tool_summarisation: fast          # Tier for condensing tool output (Haiku)
    complex_reasoning: complex        # Tier for escalated reasoning (Opus)
  auto_escalate: false               # Future: auto-escalate on failure
  max_delegation_depth: 3            # Prevent infinite delegation chains

Implementation steps

  1. Create src/backends/native/orchestrator.ts:
    • Constructor takes ModelRouter, systemPrompt, session, toolRegistry, toolExecutor, delegation config.
    • delegate(request: SubAgentRequest): Promise<SubAgentResult> — single-turn call to modelRouter.chat() with specified tier.
    • process(userMessage: string): Promise<string> — delegates to internal NativeAgent.
    • Tracks delegation depth to prevent loops.
    • Logs tier usage for cost visibility.
  2. Create src/backends/native/prompts.ts with task-specific system prompts.
  3. Update createMessageRouter() in src/daemon/index.ts to use AgentOrchestrator instead of raw NativeAgent.
  4. Add agents config block to schema.
  5. Wire delegation config through to compaction (Phase 1) and memory (Phase 2).
  6. Tests: delegation routing, tier selection, depth limiting.

Cost implications

Operation Without delegation With delegation
Compaction summary Opus/Sonnet ($$$) Haiku ($)
Memory extraction Opus/Sonnet ($$$) Haiku ($)
10 classifications Opus/Sonnet ($$$) Haiku ($)
Complex reasoning Sonnet ($$) Opus ($$$) — but only when needed

Net effect: significant cost reduction for background tasks, with targeted spend on complex reasoning only when it matters.


Phase 1: Context Compaction (P0)

Problem

Flynn sends the entire session history to the model on every turn. There is no summarisation, trimming, or token budgeting. Once a conversation exceeds the model's context window, it fails hard.

Current flow (src/backends/native/agent.ts:92-165):

toolLoop() → loopMessages = full this.history → send to model

The SessionStore (src/session/store.ts) and ManagedSession (src/session/manager.ts) store every message verbatim and replay them all on load.

Design

Token counting

Add a tokenCount utility that estimates token counts per message. Two strategies:

  1. Cheap estimate — character-based heuristic (chars / 4 for English). Good enough for budgeting.
  2. Accurate count — use the Anthropic SDK's count_tokens or tiktoken for OpenAI. Only needed if we want precise billing.

Start with the cheap estimate; add accurate counting later behind a flag.

Compaction strategy

Use a summarise-and-replace approach (same as OpenClaw):

  1. When total estimated tokens exceed a compaction threshold (configurable, default: 80% of model's context window), trigger compaction.
  2. Take all messages except the last N turns (configurable, default: 4 turns).
  3. Delegate the summarisation request to the fast tier (Haiku) via orchestrator.delegate(): "Summarise this conversation so far, preserving key facts, decisions, and context." This is a well-defined extraction task that doesn't need complex reasoning.
  4. Replace the older messages with a single [system_summary] message.
  5. Persist the compacted history to SQLite (replace the old messages).

Where compaction runs

Compaction is a concern of AgentOrchestrator (Phase 0), not the session store. The orchestrator decides when to compact based on the model it's using, and delegates the summary generation to the fast tier via orchestrator.delegate({ tier: 'fast', ... }).

New files

File Purpose
src/context/tokens.ts Token estimation utilities
src/context/compaction.ts Compaction logic (summarise + replace)

Changes to existing files

File Change
src/backends/native/agent.ts Add compactIfNeeded() call before building loopMessages. Add compaction config to NativeAgentConfig.
src/session/manager.ts Add ManagedSession.replaceHistory(messages) method for compaction to persist the compacted state.
src/session/store.ts Add replaceMessages(sessionId, messages) — atomic delete + re-insert in a transaction.
src/models/types.ts Add optional contextWindow field to ChatResponse or create a ModelCapabilities type.
src/config/schema.ts Add compaction config block: { enabled, threshold_pct, keep_turns, summary_model? }.
src/daemon/index.ts Pass compaction config to agent creation.

Config additions

compaction:
  enabled: true
  threshold_pct: 80          # Trigger at 80% of context window
  keep_turns: 4              # Always keep the last 4 exchanges
  # summary_tier is configured in agents.delegation.compaction (default: fast/Haiku)

Chat commands

Command Description
/compact Force compaction of the current session immediately.

Implementation steps

  1. Create src/context/tokens.ts with estimateTokens(text: string): number and estimateMessageTokens(messages: Message[]): number.
  2. Create src/context/compaction.ts with compactHistory(opts: CompactionOpts): Promise<Message[]>:
    • Takes messages, orchestrator (for delegation), keep_turns.
    • Calls orchestrator.delegate({ tier: 'fast', ... }) for the summary.
    • Returns [summaryMessage, ...recentMessages].
  3. Add replaceMessages() to SessionStore.
  4. Add replaceHistory() to ManagedSession.
  5. Add compaction config to schema.
  6. Wire compactIfNeeded() into AgentOrchestrator.process() — called before building the request, checks token budget.
  7. Add /compact command handling in the message router.
  8. Tests: token estimation accuracy, compaction trigger logic, history replacement, delegation to fast tier.

Model context window sizes

Hard-code a lookup table in src/context/tokens.ts:

const CONTEXT_WINDOWS: Record<string, number> = {
  'claude-sonnet-4-20250514': 200_000,
  'claude-3-5-haiku-20241022': 200_000,
  'gpt-4o': 128_000,
  'gpt-4o-mini': 128_000,
  // ... etc
};

Allow override in config: models.default.context_window: 128000.


Phase 2: Memory System (P0)

Problem

Flynn has no persistent knowledge across sessions. Every new session starts blank. The agent can't remember user preferences, past decisions, or accumulated knowledge.

Design

A lightweight memory system with three layers:

  1. Memory files — Markdown files that the agent can read/write (like OpenClaw's MEMORY.md).
  2. Memory toolsmemory.read, memory.write, memory.search builtin tools.
  3. Auto-indexing — After compaction, key facts are extracted and appended to memory.

Storage

Use a dedicated SQLite table in the existing sessions.db (or a separate memory.db):

CREATE TABLE memory_entries (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  session_id TEXT,           -- NULL for global memories
  namespace TEXT NOT NULL,   -- 'user', 'facts', 'preferences', etc.
  key TEXT NOT NULL,
  content TEXT NOT NULL,
  embedding BLOB,            -- Future: vector embedding for search
  created_at INTEGER NOT NULL DEFAULT (unixepoch()),
  updated_at INTEGER NOT NULL DEFAULT (unixepoch())
);
CREATE INDEX idx_memory_ns ON memory_entries(namespace);
CREATE INDEX idx_memory_session ON memory_entries(session_id);

Phase 2a: File-based memory (MVP)

The simplest useful memory: a markdown file per namespace in ~/.local/share/flynn/memory/.

~/.local/share/flynn/memory/
├── global.md          # Cross-session knowledge
├── user.md            # User preferences, facts about the user
└── sessions/
    └── {session_id}.md  # Per-session notes

Memory tools

Tool Description
memory.read Read a memory file by namespace. Args: { namespace: string }
memory.write Append to or replace a memory file. Args: { namespace: string, content: string, mode: 'append' | 'replace' }
memory.search Search across all memory files for a keyword. Args: { query: string }. Returns matching lines with context.

Phase 2b: Vector search (future)

Defer vector embeddings and semantic search to a later phase. The file-based approach with keyword search covers 80% of use cases.

When implemented:

  • Add sqlite-vec or similar for vector storage
  • Embed memory entries on write using the configured model's embedding API
  • Hybrid search: keyword (BM25) + vector similarity

System prompt integration

On every agent turn, inject a [Memory Context] section into the system prompt:

# Memory Context

The following is your persistent memory. Use it to maintain continuity across sessions.

## User
{contents of user.md, truncated to ~1000 tokens}

## Global
{contents of global.md, truncated to ~1000 tokens}

This is injected dynamically by the agent before each request, not baked into the static system prompt.

Auto-extraction after compaction

When compaction runs (Phase 1), add a follow-up step using the fast tier (Haiku) via orchestrator.delegate():

  1. Along with the summary, delegate to Haiku to extract any new facts worth remembering (user preferences, decisions, names, etc.). This is a simple extraction task — no need for Sonnet/Opus.
  2. Append extracted facts to user.md or global.md.

This creates a natural knowledge accumulation loop: conversation → compaction (Haiku) → memory extraction (Haiku) → next session gets richer context.

The cost of these background operations is minimal since they run on the cheapest model tier.

New files

File Purpose
src/memory/store.ts MemoryStore class — read/write/search markdown files
src/memory/index.ts Exports
src/tools/builtin/memory-read.ts memory.read tool
src/tools/builtin/memory-write.ts memory.write tool
src/tools/builtin/memory-search.ts memory.search tool

Changes to existing files

File Change
src/tools/builtin/index.ts Register memory tools in allBuiltinTools
src/backends/native/orchestrator.ts Inject memory context into system prompt before each request
src/context/compaction.ts Add memory extraction step after summarisation (delegates to fast tier)
src/daemon/index.ts Initialize MemoryStore, pass to orchestrator config
src/config/schema.ts Add memory config block: { enabled, dir, namespaces, auto_extract }

Config additions

memory:
  enabled: true
  dir: ~/.local/share/flynn/memory
  auto_extract: true         # Extract facts during compaction
  max_context_tokens: 2000   # Max tokens injected per turn from memory

Implementation steps

  1. Create src/memory/store.ts:
    • read(namespace): string — read file contents
    • write(namespace, content, mode): void — append or replace
    • search(query): SearchResult[] — line-by-line keyword match with context
    • listNamespaces(): string[]
  2. Create memory tools (3 files).
  3. Register tools.
  4. Add memory context injection to NativeAgent — load memory before building the request, inject into system prompt.
  5. Add memory extraction to compaction flow.
  6. Tests: memory CRUD, search, injection, extraction.

Phase 3: Messaging Channels (P1)

Problem

Flynn has only Telegram and WebChat. The three most requested channels are WhatsApp, Discord, and Slack.

Design approach

Flynn's ChannelAdapter interface (src/channels/types.ts:51-69) is clean and well-defined. Adding a new channel means:

  1. Implement ChannelAdapter (5 methods: name, status, connect(), disconnect(), send(), onMessage()).
  2. Add config section.
  3. Register in daemon startup.

Each channel is independent — implement in any order.

3a: Discord

Library: discord.js v14 Effort: 12 days

Config

discord:
  bot_token: ${DISCORD_BOT_TOKEN}
  allowed_guild_ids: []      # Empty = all guilds
  allowed_channel_ids: []    # Empty = all channels

New files

File Purpose
src/channels/discord/adapter.ts DiscordAdapter implementing ChannelAdapter
src/channels/discord/index.ts Exports

Key decisions

  • Peer ID: Use channelId (not userId) so the agent maintains separate sessions per Discord channel.
  • Message chunking: Discord has a 2000-char limit. Chunk long responses.
  • Mentions: Only respond when mentioned (@Flynn) or in DMs. Configurable.
  • Slash commands: Register /reset and /status as Discord slash commands.

Implementation steps

  1. Add discord.js dependency.
  2. Create DiscordAdapter class.
  3. Add config schema for discord section.
  4. Register in daemon if config.discord.bot_token is set.
  5. Export from src/channels/index.ts.
  6. Test with a bot in a private server.

3b: Slack

Library: @slack/bolt (Bolt for JavaScript) Effort: 12 days

Config

slack:
  bot_token: ${SLACK_BOT_TOKEN}
  app_token: ${SLACK_APP_TOKEN}   # For Socket Mode
  signing_secret: ${SLACK_SIGNING_SECRET}
  allowed_channel_ids: []

New files

File Purpose
src/channels/slack/adapter.ts SlackAdapter implementing ChannelAdapter
src/channels/slack/index.ts Exports

Key decisions

  • Socket Mode for self-hosted deployments (no public URL needed). Falls back to HTTP events if app_token not set.
  • Peer ID: channelId:threadTs to isolate threaded conversations.
  • Message chunking: Slack has a 40,000-char limit with blocks. Use mrkdwn formatting.
  • Slash commands: /flynn-reset, /flynn-status.

3c: WhatsApp

Library: whatsapp-web.js (or @whiskeysockets/baileys for full WhatsApp Web protocol) Effort: 23 days (more complex due to QR auth)

Config

whatsapp:
  auth_dir: ~/.local/share/flynn/whatsapp-auth
  allowed_numbers: []        # E.164 format, empty = all

Key decisions

  • Auth flow: WhatsApp Web requires QR code scanning on first connect. Display QR in terminal on startup.
  • Session persistence: Store auth state in auth_dir so re-auth isn't needed on restart.
  • Peer ID: Phone number (E.164).
  • Media: Start with text-only; defer image/audio handling.

WhatsApp is the most complex channel. Consider doing Discord and Slack first, then WhatsApp.

Shared channel infrastructure

Before implementing individual channels, extract any common patterns:

  1. Message chunking utilitysrc/channels/utils/chunking.ts: chunkMessage(text: string, maxLen: number): string[]
  2. Allowlist checkingsrc/channels/utils/auth.ts: isAllowed(senderId: string, allowlist: string[]): boolean
  3. Markdown adaptationsrc/channels/utils/markdown.ts: Platform-specific markdown conversion (Discord uses different syntax from Telegram).

Phase 4: Web Search Tool (P1)

Problem

The agent has no way to search the web. This is one of the most commonly-used agent tools.

Design

Provider options

Provider Pros Cons
Brave Search API Free tier (2k/month), clean API, good results Requires API key signup
SearXNG Self-hosted, no API key, already running in homelab Results quality varies
Tavily Purpose-built for AI agents, great results Paid only
DuckDuckGo No API key needed Unofficial API, rate limits

Recommendation: Support Brave as primary, SearXNG as self-hosted alternative. Make the provider configurable.

Config

tools:
  web_search:
    provider: brave           # brave | searxng | tavily
    api_key: ${BRAVE_SEARCH_API_KEY}
    endpoint: null            # Override for SearXNG: http://searxng:8080
    max_results: 5

New files

File Purpose
src/tools/builtin/web-search.ts web.search tool

Tool interface

{
  name: 'web.search',
  description: 'Search the web for information. Returns titles, URLs, and snippets.',
  inputSchema: {
    type: 'object',
    properties: {
      query: { type: 'string', description: 'Search query' },
      count: { type: 'number', description: 'Number of results (default 5, max 20)' },
    },
    required: ['query'],
  },
}

Output format

1. **Title** — url
   Snippet text...

2. **Title** — url
   Snippet text...

Structured as markdown so the model can easily parse and reference results.

Implementation steps

  1. Create src/tools/builtin/web-search.ts.
  2. Add Brave Search API client (simple fetch — no SDK needed).
  3. Add SearXNG support as alternative backend.
  4. Add tool config section to schema.
  5. Register in allBuiltinTools.
  6. Tests: mock API responses, result formatting.

Phase 5: Background Exec / Process Management (P1)

Problem

Flynn's shell.exec (src/tools/builtin/shell.ts) is fire-and-forget: it runs a command, waits for it to finish (up to 30s timeout), and returns stdout/stderr. There's no way to:

  • Run a long-running process (e.g., npm run dev)
  • Check on a running process
  • Read its ongoing output
  • Kill it

Design

Add a process tool family that manages background processes:

Tool Description
process.start Start a command in the background. Returns a process ID.
process.status Check if a process is running, exited, or errored.
process.output Read recent stdout/stderr from a background process.
process.kill Kill a background process.
process.list List all managed background processes.

Process manager

Create a ProcessManager class that maintains a registry of spawned processes:

interface ManagedProcess {
  id: string;
  command: string;
  cwd?: string;
  pid: number;
  status: 'running' | 'exited' | 'killed' | 'error';
  exitCode?: number;
  outputBuffer: RingBuffer;  // Last N bytes of combined stdout+stderr
  startedAt: number;
}

Output buffering

Use a ring buffer (circular buffer) to keep the last 64KB of output per process. This prevents memory leaks from long-running processes with verbose output.

Safety

  • Max processes: Limit to 10 concurrent background processes.
  • Auto-cleanup: Kill processes that have been running for more than 1 hour (configurable).
  • Shutdown cleanup: Kill all managed processes on daemon shutdown.
  • Hook integration: process.start should go through the confirmation engine (same as shell.exec).

New files

File Purpose
src/tools/builtin/process/manager.ts ProcessManager class
src/tools/builtin/process/start.ts process.start tool
src/tools/builtin/process/status.ts process.status tool
src/tools/builtin/process/output.ts process.output tool
src/tools/builtin/process/kill.ts process.kill tool
src/tools/builtin/process/list.ts process.list tool
src/tools/builtin/process/index.ts Exports

Changes to existing files

File Change
src/tools/builtin/index.ts Register process tools
src/daemon/index.ts Create ProcessManager, pass to tool constructors, register shutdown handler
src/config/schema.ts Add process config: { max_concurrent, max_runtime_minutes, buffer_size }

Implementation steps

  1. Implement RingBuffer utility (or use an npm package like ringbufferjs).
  2. Create ProcessManager class with spawn, track, kill, cleanup methods.
  3. Implement 5 process tools.
  4. Register tools and wire shutdown cleanup.
  5. Tests: spawn + kill lifecycle, output buffering, max process limits.

Phase 6: Enhanced web_fetch (P1)

Problem

Flynn's web.fetch (src/tools/builtin/web-fetch.ts:19-50) is a bare fetch() call that returns raw HTML. This is nearly useless for LLMs — they need extracted text/markdown, not raw HTML with scripts and styles.

Design

Enhancements

  1. HTML-to-markdown extraction — Strip scripts/styles, convert to markdown using @mozilla/readability + turndown.
  2. Format parameter — Let the agent choose: text, markdown (default), or html.
  3. Response caching — Cache fetched pages for 5 minutes to avoid redundant requests in tool loops.
  4. Redirect following — Already handled by fetch(), but add a max redirect limit.
  5. Content type handling — Return JSON prettified, plain text as-is, HTML converted.

Libraries

Package Purpose
turndown HTML → Markdown converter
linkedom Lightweight DOM implementation (for Readability)
@mozilla/readability Extract article content from HTML

Using linkedom instead of jsdom — it's much lighter and sufficient for content extraction.

Tool interface update

{
  name: 'web.fetch',
  description: 'Fetch a URL and extract its content. Returns clean text/markdown by default, not raw HTML.',
  inputSchema: {
    type: 'object',
    properties: {
      url: { type: 'string', description: 'The URL to fetch' },
      format: { type: 'string', enum: ['markdown', 'text', 'html'], description: 'Output format (default: markdown)' },
      timeout: { type: 'number', description: 'Timeout in milliseconds (default 15000)' },
    },
    required: ['url'],
  },
}

Caching

Simple in-memory cache with TTL:

const cache = new Map<string, { content: string; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes

Changes to existing files

File Change
src/tools/builtin/web-fetch.ts Major rewrite — add extraction, caching, format parameter

Implementation steps

  1. Add turndown, linkedom, @mozilla/readability dependencies.
  2. Create extraction pipeline: fetch → parse DOM → readability → turndown → clean markdown.
  3. Add format parameter handling.
  4. Add response caching.
  5. Update tool description to reflect new capabilities.
  6. Tests: extraction from sample HTML, caching behaviour, format handling.

Implementation Order

Week 1:  Phase 0 (Multi-Model Delegation) ─────────────────────── P0 (foundational)
Week 2:  Phase 1 (Context Compaction) ─────────────────────────── P0 (uses delegation)
Week 3:  Phase 2 (Memory System) ──────────────────────────────── P0 (uses delegation)
Week 4:  Phase 4 (Web Search) + Phase 6 (Enhanced web_fetch) ─── P1 (quick wins)
Week 5:  Phase 5 (Process Management) ─────────────────────────── P1
Week 6+: Phase 3 (Channels: Discord → Slack → WhatsApp) ──────── P1

Rationale:

  • Delegation first — Phase 0 is foundational. Compaction and memory both need to delegate subtasks to cheaper models. Building the orchestrator first means Phase 1 and 2 can use it immediately.
  • Compaction and memory are sequential (memory extraction depends on compaction).
  • Web search and enhanced web_fetch are small, independent, and immediately useful — do them as palate cleansers between the big features.
  • Process management is self-contained.
  • Channels are the largest body of work but each is independent — can be done in parallel or interleaved.

Model usage across all phases

Phase Primary model (user-facing) Delegated tasks Delegation tier
0 Sonnet (default) Sub-agent infrastructure N/A (infrastructure)
1 Sonnet (default) Compaction summaries Haiku (fast)
2 Sonnet (default) Memory fact extraction Haiku (fast)
3 Sonnet (default) Message classification, markdown adaptation Haiku (fast)
4 Sonnet (default) None (direct API call) N/A
5 Sonnet (default) None N/A
6 Sonnet (default) None N/A

Opus (complex) is reserved for user-facing tasks that require deep reasoning — it's never used for background operations.


Testing Strategy

Each phase should include:

  1. Unit tests — Pure logic (token estimation, ring buffer, markdown extraction, memory search).
  2. Integration tests — Tool execution with mocked model responses.
  3. Manual smoke test — Run via TUI and Telegram to verify end-to-end.

Key test files to create:

Test file Covers
src/backends/native/orchestrator.test.ts Delegation routing, tier selection, depth limiting, cost tracking
src/context/tokens.test.ts Token estimation accuracy
src/context/compaction.test.ts Compaction trigger logic, summary replacement, fast-tier delegation
src/memory/store.test.ts Memory CRUD, search
src/tools/builtin/web-search.test.ts API mocking, result formatting
src/tools/builtin/process/manager.test.ts Process lifecycle, cleanup
src/tools/builtin/web-fetch.test.ts HTML extraction, caching

Risk Assessment

Risk Impact Mitigation
Haiku summaries lose critical context vs Sonnet High Validate quality; use detailed extraction prompts; allow per-task tier override in config
Delegation depth spirals (agent delegates to agent that delegates...) Medium Hard limit max_delegation_depth: 3; sub-agents cannot spawn sub-agents
Fast tier unavailable (Haiku rate limit / outage) Medium Fallback to default tier for delegation; log the fallback cost increase
Compaction summaries lose critical context High Keep last 4 turns intact; allow user to adjust keep_turns; log what was compacted
Memory injection bloats system prompt Medium Hard cap on injected memory tokens; truncate oldest entries
WhatsApp auth flow is fragile Medium Defer WhatsApp to last; use battle-tested Baileys library
Brave Search free tier limits (2k/month) Low SearXNG as free self-hosted fallback
Background processes leak resources Medium Max process limit, auto-kill timeout, shutdown cleanup
HTML extraction fails on JS-heavy sites Low Accept graceful degradation; defer CDP/browser fallback to P3