33 KiB
Flynn P0 + P1 Implementation Plan
Date: 2026-02-06 Scope: 7 features from the gap analysis — the functionally critical (P0) and high-impact (P1) items. Prerequisite: Feature Gap Analysis
Feature Summary
| # | Feature | Priority | Est. Effort | Dependencies |
|---|---|---|---|---|
| 0 | Multi-model sub-agent delegation | P0 | 3–4 days | None (foundational) |
| 1 | Context compaction | P0 | 2–3 days | #0 (uses cheap model for summaries) |
| 2 | Memory system | P0 | 3–4 days | #0, #1 |
| 3 | Messaging channels (WhatsApp, Discord, Slack) | P1 | 2–3 days each | None |
| 4 | Web search tool | P1 | 0.5 day | None |
| 5 | Background exec / process management | P1 | 1–2 days | None |
| 6 | Enhanced web_fetch | P1 | 1 day | None |
Total estimated effort: 15–22 days
Phase 0: Multi-Model Sub-Agent Delegation (P0 — Foundational)
Problem
Flynn currently runs a single NativeAgent per session that talks to one model tier at a time. The ModelRouter (src/models/router.ts) supports tiers (fast/default/complex/local) and a fallback chain, but:
- There is no concept of sub-agents — the primary agent can't spawn a cheaper model for a subtask.
- Model selection is per-session (via
/modelcommand), not per-task. - Compaction summaries, memory extraction, and classification tasks all use the same expensive model as the main conversation — wasteful.
- There is no orchestrator pattern where an expensive model (Opus) plans and delegates to cheaper models (Sonnet, Haiku) for execution.
Model Tier Mapping
| Tier | Model | Use For |
|---|---|---|
| complex (orchestrator) | Claude Opus 4.6 | Planning, orchestration, complex reasoning, multi-step decisions |
| default (worker) | Claude Sonnet 4.5 | General conversation, tool use, code generation, channel adapters |
| fast (utility) | Claude Haiku 4.5 | Compaction summaries, memory extraction, classification, keyword extraction, formatting |
This maps directly to Flynn's existing ModelTier type. The infrastructure is already there — what's missing is the delegation mechanism.
Design
Sub-agent spawning
Add the ability for NativeAgent to spawn ephemeral sub-agents that run a single task on a specific model tier and return the result:
interface SubAgentRequest {
/** Which model tier to use for this subtask. */
tier: ModelTier;
/** System prompt for the sub-agent (task-specific). */
systemPrompt: string;
/** The task message. */
message: string;
/** Max tokens for the response. */
maxTokens?: number;
/** Whether to include tools. Default: false (most subtasks are pure text). */
tools?: boolean;
}
interface SubAgentResult {
content: string;
usage: TokenUsage;
tier: ModelTier;
}
The sub-agent is stateless — no session, no history, just a single request/response. It's a thin wrapper around modelRouter.chat() with a specific tier.
Where delegation happens
| Task | Delegated to | Reason |
|---|---|---|
| Compaction summary | fast (Haiku) | Summarisation is a well-defined extraction task; doesn't need complex reasoning |
| Memory fact extraction | fast (Haiku) | Simple extraction from conversation text |
| Message classification | fast (Haiku) | "Is this a command, question, or statement?" — trivial |
| Tool result summarisation | fast (Haiku) | Condense verbose tool output before feeding back |
| Primary conversation | default (Sonnet) | General-purpose agent work |
| Complex planning/reasoning | complex (Opus) | Multi-step planning, architecture decisions, ambiguous requests |
| Sub-agent orchestration | complex (Opus) | When the agent decides to break a task into subtasks |
Automatic tier escalation
Add optional auto-escalation where the primary agent (Sonnet) can recognise it's struggling and escalate to Opus:
- If the agent hits
maxIterationswithout completing the task → escalate tocomplex. - If the agent's response contains explicit uncertainty markers ("I'm not sure", "This is beyond...") → offer escalation.
- Configurable:
auto_escalate: truein config.
This is a future enhancement — start with explicit delegation points (compaction, memory extraction) and add auto-escalation later.
AgentOrchestrator class
Create a new AgentOrchestrator that sits between the channel message handler and the NativeAgent:
class AgentOrchestrator {
private primaryAgent: NativeAgent; // default tier (Sonnet)
private modelRouter: ModelRouter;
/** Spawn a sub-agent for a single-turn task on a specific tier. */
async delegate(request: SubAgentRequest): Promise<SubAgentResult>;
/** Process a user message — delegates to primary agent, which may internally delegate subtasks. */
async process(userMessage: string): Promise<string>;
}
The orchestrator replaces the current direct NativeAgent usage in the message router (src/daemon/index.ts:139-186).
Passing the orchestrator to tools and compaction
The key insight: compaction and memory extraction don't need a new agent class — they just need access to modelRouter.chat(request, 'fast'). The orchestrator provides a delegate() method that any subsystem can call:
// In compaction.ts
const summary = await orchestrator.delegate({
tier: 'fast',
systemPrompt: COMPACTION_SYSTEM_PROMPT,
message: `Summarise this conversation:\n\n${messagesToCompact}`,
maxTokens: 1024,
});
// In memory extraction
const facts = await orchestrator.delegate({
tier: 'fast',
systemPrompt: MEMORY_EXTRACTION_PROMPT,
message: `Extract key facts from:\n\n${summary}`,
maxTokens: 512,
});
New files
| File | Purpose |
|---|---|
src/backends/native/orchestrator.ts |
AgentOrchestrator — sub-agent spawning and delegation |
src/backends/native/prompts.ts |
System prompts for delegated tasks (compaction, extraction, classification) |
Changes to existing files
| File | Change |
|---|---|
src/backends/native/agent.ts |
Accept optional orchestrator reference for internal delegation. Add delegateSubtask() method. |
src/daemon/index.ts |
Replace direct NativeAgent creation in createMessageRouter() with AgentOrchestrator. |
src/config/schema.ts |
Add agents config block for tier assignment and delegation policy. |
src/models/router.ts |
No changes needed — already supports chat(request, tier). |
Config additions
agents:
primary_tier: default # Model tier for main conversation (Sonnet)
delegation:
compaction: fast # Tier for compaction summaries (Haiku)
memory_extraction: fast # Tier for memory fact extraction (Haiku)
classification: fast # Tier for message classification (Haiku)
tool_summarisation: fast # Tier for condensing tool output (Haiku)
complex_reasoning: complex # Tier for escalated reasoning (Opus)
auto_escalate: false # Future: auto-escalate on failure
max_delegation_depth: 3 # Prevent infinite delegation chains
Implementation steps
- Create
src/backends/native/orchestrator.ts:- Constructor takes
ModelRouter,systemPrompt,session,toolRegistry,toolExecutor, delegation config. delegate(request: SubAgentRequest): Promise<SubAgentResult>— single-turn call tomodelRouter.chat()with specified tier.process(userMessage: string): Promise<string>— delegates to internalNativeAgent.- Tracks delegation depth to prevent loops.
- Logs tier usage for cost visibility.
- Constructor takes
- Create
src/backends/native/prompts.tswith task-specific system prompts. - Update
createMessageRouter()insrc/daemon/index.tsto useAgentOrchestratorinstead of rawNativeAgent. - Add
agentsconfig block to schema. - Wire delegation config through to compaction (Phase 1) and memory (Phase 2).
- Tests: delegation routing, tier selection, depth limiting.
Cost implications
| Operation | Without delegation | With delegation |
|---|---|---|
| Compaction summary | Opus/Sonnet ($$$) | Haiku ($) |
| Memory extraction | Opus/Sonnet ($$$) | Haiku ($) |
| 10 classifications | Opus/Sonnet ($$$) | Haiku ($) |
| Complex reasoning | Sonnet ($$) | Opus ($$$) — but only when needed |
Net effect: significant cost reduction for background tasks, with targeted spend on complex reasoning only when it matters.
Phase 1: Context Compaction (P0)
Problem
Flynn sends the entire session history to the model on every turn. There is no summarisation, trimming, or token budgeting. Once a conversation exceeds the model's context window, it fails hard.
Current flow (src/backends/native/agent.ts:92-165):
toolLoop() → loopMessages = full this.history → send to model
The SessionStore (src/session/store.ts) and ManagedSession (src/session/manager.ts) store every message verbatim and replay them all on load.
Design
Token counting
Add a tokenCount utility that estimates token counts per message. Two strategies:
- Cheap estimate — character-based heuristic (
chars / 4for English). Good enough for budgeting. - Accurate count — use the Anthropic SDK's
count_tokensortiktokenfor OpenAI. Only needed if we want precise billing.
Start with the cheap estimate; add accurate counting later behind a flag.
Compaction strategy
Use a summarise-and-replace approach (same as OpenClaw):
- When total estimated tokens exceed a compaction threshold (configurable, default: 80% of model's context window), trigger compaction.
- Take all messages except the last N turns (configurable, default: 4 turns).
- Delegate the summarisation request to the fast tier (Haiku) via
orchestrator.delegate(): "Summarise this conversation so far, preserving key facts, decisions, and context." This is a well-defined extraction task that doesn't need complex reasoning. - Replace the older messages with a single
[system_summary]message. - Persist the compacted history to SQLite (replace the old messages).
Where compaction runs
Compaction is a concern of AgentOrchestrator (Phase 0), not the session store. The orchestrator decides when to compact based on the model it's using, and delegates the summary generation to the fast tier via orchestrator.delegate({ tier: 'fast', ... }).
New files
| File | Purpose |
|---|---|
src/context/tokens.ts |
Token estimation utilities |
src/context/compaction.ts |
Compaction logic (summarise + replace) |
Changes to existing files
| File | Change |
|---|---|
src/backends/native/agent.ts |
Add compactIfNeeded() call before building loopMessages. Add compaction config to NativeAgentConfig. |
src/session/manager.ts |
Add ManagedSession.replaceHistory(messages) method for compaction to persist the compacted state. |
src/session/store.ts |
Add replaceMessages(sessionId, messages) — atomic delete + re-insert in a transaction. |
src/models/types.ts |
Add optional contextWindow field to ChatResponse or create a ModelCapabilities type. |
src/config/schema.ts |
Add compaction config block: { enabled, threshold_pct, keep_turns, summary_model? }. |
src/daemon/index.ts |
Pass compaction config to agent creation. |
Config additions
compaction:
enabled: true
threshold_pct: 80 # Trigger at 80% of context window
keep_turns: 4 # Always keep the last 4 exchanges
# summary_tier is configured in agents.delegation.compaction (default: fast/Haiku)
Chat commands
| Command | Description |
|---|---|
/compact |
Force compaction of the current session immediately. |
Implementation steps
- Create
src/context/tokens.tswithestimateTokens(text: string): numberandestimateMessageTokens(messages: Message[]): number. - Create
src/context/compaction.tswithcompactHistory(opts: CompactionOpts): Promise<Message[]>:- Takes messages, orchestrator (for delegation), keep_turns.
- Calls
orchestrator.delegate({ tier: 'fast', ... })for the summary. - Returns
[summaryMessage, ...recentMessages].
- Add
replaceMessages()toSessionStore. - Add
replaceHistory()toManagedSession. - Add compaction config to schema.
- Wire
compactIfNeeded()intoAgentOrchestrator.process()— called before building the request, checks token budget. - Add
/compactcommand handling in the message router. - Tests: token estimation accuracy, compaction trigger logic, history replacement, delegation to fast tier.
Model context window sizes
Hard-code a lookup table in src/context/tokens.ts:
const CONTEXT_WINDOWS: Record<string, number> = {
'claude-sonnet-4-20250514': 200_000,
'claude-3-5-haiku-20241022': 200_000,
'gpt-4o': 128_000,
'gpt-4o-mini': 128_000,
// ... etc
};
Allow override in config: models.default.context_window: 128000.
Phase 2: Memory System (P0)
Problem
Flynn has no persistent knowledge across sessions. Every new session starts blank. The agent can't remember user preferences, past decisions, or accumulated knowledge.
Design
A lightweight memory system with three layers:
- Memory files — Markdown files that the agent can read/write (like OpenClaw's
MEMORY.md). - Memory tools —
memory.read,memory.write,memory.searchbuiltin tools. - Auto-indexing — After compaction, key facts are extracted and appended to memory.
Storage
Use a dedicated SQLite table in the existing sessions.db (or a separate memory.db):
CREATE TABLE memory_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT, -- NULL for global memories
namespace TEXT NOT NULL, -- 'user', 'facts', 'preferences', etc.
key TEXT NOT NULL,
content TEXT NOT NULL,
embedding BLOB, -- Future: vector embedding for search
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
updated_at INTEGER NOT NULL DEFAULT (unixepoch())
);
CREATE INDEX idx_memory_ns ON memory_entries(namespace);
CREATE INDEX idx_memory_session ON memory_entries(session_id);
Phase 2a: File-based memory (MVP)
The simplest useful memory: a markdown file per namespace in ~/.local/share/flynn/memory/.
~/.local/share/flynn/memory/
├── global.md # Cross-session knowledge
├── user.md # User preferences, facts about the user
└── sessions/
└── {session_id}.md # Per-session notes
Memory tools
| Tool | Description |
|---|---|
memory.read |
Read a memory file by namespace. Args: { namespace: string } |
memory.write |
Append to or replace a memory file. Args: { namespace: string, content: string, mode: 'append' | 'replace' } |
memory.search |
Search across all memory files for a keyword. Args: { query: string }. Returns matching lines with context. |
Phase 2b: Vector search (future)
Defer vector embeddings and semantic search to a later phase. The file-based approach with keyword search covers 80% of use cases.
When implemented:
- Add
sqlite-vecor similar for vector storage - Embed memory entries on write using the configured model's embedding API
- Hybrid search: keyword (BM25) + vector similarity
System prompt integration
On every agent turn, inject a [Memory Context] section into the system prompt:
# Memory Context
The following is your persistent memory. Use it to maintain continuity across sessions.
## User
{contents of user.md, truncated to ~1000 tokens}
## Global
{contents of global.md, truncated to ~1000 tokens}
This is injected dynamically by the agent before each request, not baked into the static system prompt.
Auto-extraction after compaction
When compaction runs (Phase 1), add a follow-up step using the fast tier (Haiku) via orchestrator.delegate():
- Along with the summary, delegate to Haiku to extract any new facts worth remembering (user preferences, decisions, names, etc.). This is a simple extraction task — no need for Sonnet/Opus.
- Append extracted facts to
user.mdorglobal.md.
This creates a natural knowledge accumulation loop: conversation → compaction (Haiku) → memory extraction (Haiku) → next session gets richer context.
The cost of these background operations is minimal since they run on the cheapest model tier.
New files
| File | Purpose |
|---|---|
src/memory/store.ts |
MemoryStore class — read/write/search markdown files |
src/memory/index.ts |
Exports |
src/tools/builtin/memory-read.ts |
memory.read tool |
src/tools/builtin/memory-write.ts |
memory.write tool |
src/tools/builtin/memory-search.ts |
memory.search tool |
Changes to existing files
| File | Change |
|---|---|
src/tools/builtin/index.ts |
Register memory tools in allBuiltinTools |
src/backends/native/orchestrator.ts |
Inject memory context into system prompt before each request |
src/context/compaction.ts |
Add memory extraction step after summarisation (delegates to fast tier) |
src/daemon/index.ts |
Initialize MemoryStore, pass to orchestrator config |
src/config/schema.ts |
Add memory config block: { enabled, dir, namespaces, auto_extract } |
Config additions
memory:
enabled: true
dir: ~/.local/share/flynn/memory
auto_extract: true # Extract facts during compaction
max_context_tokens: 2000 # Max tokens injected per turn from memory
Implementation steps
- Create
src/memory/store.ts:read(namespace): string— read file contentswrite(namespace, content, mode): void— append or replacesearch(query): SearchResult[]— line-by-line keyword match with contextlistNamespaces(): string[]
- Create memory tools (3 files).
- Register tools.
- Add memory context injection to
NativeAgent— load memory before building the request, inject into system prompt. - Add memory extraction to compaction flow.
- Tests: memory CRUD, search, injection, extraction.
Phase 3: Messaging Channels (P1)
Problem
Flynn has only Telegram and WebChat. The three most requested channels are WhatsApp, Discord, and Slack.
Design approach
Flynn's ChannelAdapter interface (src/channels/types.ts:51-69) is clean and well-defined. Adding a new channel means:
- Implement
ChannelAdapter(5 methods:name,status,connect(),disconnect(),send(),onMessage()). - Add config section.
- Register in daemon startup.
Each channel is independent — implement in any order.
3a: Discord
Library: discord.js v14
Effort: 1–2 days
Config
discord:
bot_token: ${DISCORD_BOT_TOKEN}
allowed_guild_ids: [] # Empty = all guilds
allowed_channel_ids: [] # Empty = all channels
New files
| File | Purpose |
|---|---|
src/channels/discord/adapter.ts |
DiscordAdapter implementing ChannelAdapter |
src/channels/discord/index.ts |
Exports |
Key decisions
- Peer ID: Use
channelId(notuserId) so the agent maintains separate sessions per Discord channel. - Message chunking: Discord has a 2000-char limit. Chunk long responses.
- Mentions: Only respond when mentioned (
@Flynn) or in DMs. Configurable. - Slash commands: Register
/resetand/statusas Discord slash commands.
Implementation steps
- Add
discord.jsdependency. - Create
DiscordAdapterclass. - Add config schema for
discordsection. - Register in daemon if
config.discord.bot_tokenis set. - Export from
src/channels/index.ts. - Test with a bot in a private server.
3b: Slack
Library: @slack/bolt (Bolt for JavaScript)
Effort: 1–2 days
Config
slack:
bot_token: ${SLACK_BOT_TOKEN}
app_token: ${SLACK_APP_TOKEN} # For Socket Mode
signing_secret: ${SLACK_SIGNING_SECRET}
allowed_channel_ids: []
New files
| File | Purpose |
|---|---|
src/channels/slack/adapter.ts |
SlackAdapter implementing ChannelAdapter |
src/channels/slack/index.ts |
Exports |
Key decisions
- Socket Mode for self-hosted deployments (no public URL needed). Falls back to HTTP events if
app_tokennot set. - Peer ID:
channelId:threadTsto isolate threaded conversations. - Message chunking: Slack has a 40,000-char limit with blocks. Use
mrkdwnformatting. - Slash commands:
/flynn-reset,/flynn-status.
3c: WhatsApp
Library: whatsapp-web.js (or @whiskeysockets/baileys for full WhatsApp Web protocol)
Effort: 2–3 days (more complex due to QR auth)
Config
whatsapp:
auth_dir: ~/.local/share/flynn/whatsapp-auth
allowed_numbers: [] # E.164 format, empty = all
Key decisions
- Auth flow: WhatsApp Web requires QR code scanning on first connect. Display QR in terminal on startup.
- Session persistence: Store auth state in
auth_dirso re-auth isn't needed on restart. - Peer ID: Phone number (E.164).
- Media: Start with text-only; defer image/audio handling.
WhatsApp is the most complex channel. Consider doing Discord and Slack first, then WhatsApp.
Shared channel infrastructure
Before implementing individual channels, extract any common patterns:
- Message chunking utility —
src/channels/utils/chunking.ts:chunkMessage(text: string, maxLen: number): string[] - Allowlist checking —
src/channels/utils/auth.ts:isAllowed(senderId: string, allowlist: string[]): boolean - Markdown adaptation —
src/channels/utils/markdown.ts: Platform-specific markdown conversion (Discord uses different syntax from Telegram).
Phase 4: Web Search Tool (P1)
Problem
The agent has no way to search the web. This is one of the most commonly-used agent tools.
Design
Provider options
| Provider | Pros | Cons |
|---|---|---|
| Brave Search API | Free tier (2k/month), clean API, good results | Requires API key signup |
| SearXNG | Self-hosted, no API key, already running in homelab | Results quality varies |
| Tavily | Purpose-built for AI agents, great results | Paid only |
| DuckDuckGo | No API key needed | Unofficial API, rate limits |
Recommendation: Support Brave as primary, SearXNG as self-hosted alternative. Make the provider configurable.
Config
tools:
web_search:
provider: brave # brave | searxng | tavily
api_key: ${BRAVE_SEARCH_API_KEY}
endpoint: null # Override for SearXNG: http://searxng:8080
max_results: 5
New files
| File | Purpose |
|---|---|
src/tools/builtin/web-search.ts |
web.search tool |
Tool interface
{
name: 'web.search',
description: 'Search the web for information. Returns titles, URLs, and snippets.',
inputSchema: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
count: { type: 'number', description: 'Number of results (default 5, max 20)' },
},
required: ['query'],
},
}
Output format
1. **Title** — url
Snippet text...
2. **Title** — url
Snippet text...
Structured as markdown so the model can easily parse and reference results.
Implementation steps
- Create
src/tools/builtin/web-search.ts. - Add Brave Search API client (simple
fetch— no SDK needed). - Add SearXNG support as alternative backend.
- Add tool config section to schema.
- Register in
allBuiltinTools. - Tests: mock API responses, result formatting.
Phase 5: Background Exec / Process Management (P1)
Problem
Flynn's shell.exec (src/tools/builtin/shell.ts) is fire-and-forget: it runs a command, waits for it to finish (up to 30s timeout), and returns stdout/stderr. There's no way to:
- Run a long-running process (e.g.,
npm run dev) - Check on a running process
- Read its ongoing output
- Kill it
Design
Add a process tool family that manages background processes:
| Tool | Description |
|---|---|
process.start |
Start a command in the background. Returns a process ID. |
process.status |
Check if a process is running, exited, or errored. |
process.output |
Read recent stdout/stderr from a background process. |
process.kill |
Kill a background process. |
process.list |
List all managed background processes. |
Process manager
Create a ProcessManager class that maintains a registry of spawned processes:
interface ManagedProcess {
id: string;
command: string;
cwd?: string;
pid: number;
status: 'running' | 'exited' | 'killed' | 'error';
exitCode?: number;
outputBuffer: RingBuffer; // Last N bytes of combined stdout+stderr
startedAt: number;
}
Output buffering
Use a ring buffer (circular buffer) to keep the last 64KB of output per process. This prevents memory leaks from long-running processes with verbose output.
Safety
- Max processes: Limit to 10 concurrent background processes.
- Auto-cleanup: Kill processes that have been running for more than 1 hour (configurable).
- Shutdown cleanup: Kill all managed processes on daemon shutdown.
- Hook integration:
process.startshould go through the confirmation engine (same asshell.exec).
New files
| File | Purpose |
|---|---|
src/tools/builtin/process/manager.ts |
ProcessManager class |
src/tools/builtin/process/start.ts |
process.start tool |
src/tools/builtin/process/status.ts |
process.status tool |
src/tools/builtin/process/output.ts |
process.output tool |
src/tools/builtin/process/kill.ts |
process.kill tool |
src/tools/builtin/process/list.ts |
process.list tool |
src/tools/builtin/process/index.ts |
Exports |
Changes to existing files
| File | Change |
|---|---|
src/tools/builtin/index.ts |
Register process tools |
src/daemon/index.ts |
Create ProcessManager, pass to tool constructors, register shutdown handler |
src/config/schema.ts |
Add process config: { max_concurrent, max_runtime_minutes, buffer_size } |
Implementation steps
- Implement
RingBufferutility (or use an npm package likeringbufferjs). - Create
ProcessManagerclass with spawn, track, kill, cleanup methods. - Implement 5 process tools.
- Register tools and wire shutdown cleanup.
- Tests: spawn + kill lifecycle, output buffering, max process limits.
Phase 6: Enhanced web_fetch (P1)
Problem
Flynn's web.fetch (src/tools/builtin/web-fetch.ts:19-50) is a bare fetch() call that returns raw HTML. This is nearly useless for LLMs — they need extracted text/markdown, not raw HTML with scripts and styles.
Design
Enhancements
- HTML-to-markdown extraction — Strip scripts/styles, convert to markdown using
@mozilla/readability+turndown. - Format parameter — Let the agent choose:
text,markdown(default), orhtml. - Response caching — Cache fetched pages for 5 minutes to avoid redundant requests in tool loops.
- Redirect following — Already handled by
fetch(), but add a max redirect limit. - Content type handling — Return JSON prettified, plain text as-is, HTML converted.
Libraries
| Package | Purpose |
|---|---|
turndown |
HTML → Markdown converter |
linkedom |
Lightweight DOM implementation (for Readability) |
@mozilla/readability |
Extract article content from HTML |
Using linkedom instead of jsdom — it's much lighter and sufficient for content extraction.
Tool interface update
{
name: 'web.fetch',
description: 'Fetch a URL and extract its content. Returns clean text/markdown by default, not raw HTML.',
inputSchema: {
type: 'object',
properties: {
url: { type: 'string', description: 'The URL to fetch' },
format: { type: 'string', enum: ['markdown', 'text', 'html'], description: 'Output format (default: markdown)' },
timeout: { type: 'number', description: 'Timeout in milliseconds (default 15000)' },
},
required: ['url'],
},
}
Caching
Simple in-memory cache with TTL:
const cache = new Map<string, { content: string; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes
Changes to existing files
| File | Change |
|---|---|
src/tools/builtin/web-fetch.ts |
Major rewrite — add extraction, caching, format parameter |
Implementation steps
- Add
turndown,linkedom,@mozilla/readabilitydependencies. - Create extraction pipeline: fetch → parse DOM → readability → turndown → clean markdown.
- Add format parameter handling.
- Add response caching.
- Update tool description to reflect new capabilities.
- Tests: extraction from sample HTML, caching behaviour, format handling.
Implementation Order
Week 1: Phase 0 (Multi-Model Delegation) ─────────────────────── P0 (foundational)
Week 2: Phase 1 (Context Compaction) ─────────────────────────── P0 (uses delegation)
Week 3: Phase 2 (Memory System) ──────────────────────────────── P0 (uses delegation)
Week 4: Phase 4 (Web Search) + Phase 6 (Enhanced web_fetch) ─── P1 (quick wins)
Week 5: Phase 5 (Process Management) ─────────────────────────── P1
Week 6+: Phase 3 (Channels: Discord → Slack → WhatsApp) ──────── P1
Rationale:
- Delegation first — Phase 0 is foundational. Compaction and memory both need to delegate subtasks to cheaper models. Building the orchestrator first means Phase 1 and 2 can use it immediately.
- Compaction and memory are sequential (memory extraction depends on compaction).
- Web search and enhanced web_fetch are small, independent, and immediately useful — do them as palate cleansers between the big features.
- Process management is self-contained.
- Channels are the largest body of work but each is independent — can be done in parallel or interleaved.
Model usage across all phases
| Phase | Primary model (user-facing) | Delegated tasks | Delegation tier |
|---|---|---|---|
| 0 | Sonnet (default) | Sub-agent infrastructure | N/A (infrastructure) |
| 1 | Sonnet (default) | Compaction summaries | Haiku (fast) |
| 2 | Sonnet (default) | Memory fact extraction | Haiku (fast) |
| 3 | Sonnet (default) | Message classification, markdown adaptation | Haiku (fast) |
| 4 | Sonnet (default) | None (direct API call) | N/A |
| 5 | Sonnet (default) | None | N/A |
| 6 | Sonnet (default) | None | N/A |
Opus (complex) is reserved for user-facing tasks that require deep reasoning — it's never used for background operations.
Testing Strategy
Each phase should include:
- Unit tests — Pure logic (token estimation, ring buffer, markdown extraction, memory search).
- Integration tests — Tool execution with mocked model responses.
- Manual smoke test — Run via TUI and Telegram to verify end-to-end.
Key test files to create:
| Test file | Covers |
|---|---|
src/backends/native/orchestrator.test.ts |
Delegation routing, tier selection, depth limiting, cost tracking |
src/context/tokens.test.ts |
Token estimation accuracy |
src/context/compaction.test.ts |
Compaction trigger logic, summary replacement, fast-tier delegation |
src/memory/store.test.ts |
Memory CRUD, search |
src/tools/builtin/web-search.test.ts |
API mocking, result formatting |
src/tools/builtin/process/manager.test.ts |
Process lifecycle, cleanup |
src/tools/builtin/web-fetch.test.ts |
HTML extraction, caching |
Risk Assessment
| Risk | Impact | Mitigation |
|---|---|---|
| Haiku summaries lose critical context vs Sonnet | High | Validate quality; use detailed extraction prompts; allow per-task tier override in config |
| Delegation depth spirals (agent delegates to agent that delegates...) | Medium | Hard limit max_delegation_depth: 3; sub-agents cannot spawn sub-agents |
| Fast tier unavailable (Haiku rate limit / outage) | Medium | Fallback to default tier for delegation; log the fallback cost increase |
| Compaction summaries lose critical context | High | Keep last 4 turns intact; allow user to adjust keep_turns; log what was compacted |
| Memory injection bloats system prompt | Medium | Hard cap on injected memory tokens; truncate oldest entries |
| WhatsApp auth flow is fragile | Medium | Defer WhatsApp to last; use battle-tested Baileys library |
| Brave Search free tier limits (2k/month) | Low | SearXNG as free self-hosted fallback |
| Background processes leak resources | Medium | Max process limit, auto-kill timeout, shutdown cleanup |
| HTML extraction fails on JS-heavy sites | Low | Accept graceful degradation; defer CDP/browser fallback to P3 |