Files
flynn/docs/plans/2026-02-06-p0-p1-implementation-plan.md

846 lines
33 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Flynn P0 + P1 Implementation Plan
**Date:** 2026-02-06
**Scope:** 7 features from the gap analysis — the functionally critical (P0) and high-impact (P1) items.
**Prerequisite:** [Feature Gap Analysis](./2026-02-06-openclaw-feature-gap-analysis.md)
---
## Feature Summary
| # | Feature | Priority | Est. Effort | Dependencies |
|---|---------|----------|-------------|--------------|
| 0 | Multi-model sub-agent delegation | P0 | 34 days | None (foundational) |
| 1 | Context compaction | P0 | 23 days | #0 (uses cheap model for summaries) |
| 2 | Memory system | P0 | 34 days | #0, #1 |
| 3 | Messaging channels (WhatsApp, Discord, Slack) | P1 | 23 days each | None |
| 4 | Web search tool | P1 | 0.5 day | None |
| 5 | Background exec / process management | P1 | 12 days | None |
| 6 | Enhanced web_fetch | P1 | 1 day | None |
**Total estimated effort:** 1522 days
---
## Phase 0: Multi-Model Sub-Agent Delegation (P0 — Foundational)
### Problem
Flynn currently runs a **single NativeAgent per session** that talks to one model tier at a time. The `ModelRouter` (`src/models/router.ts`) supports tiers (`fast`/`default`/`complex`/`local`) and a fallback chain, but:
- There is no concept of **sub-agents** — the primary agent can't spawn a cheaper model for a subtask.
- Model selection is **per-session** (via `/model` command), not **per-task**.
- Compaction summaries, memory extraction, and classification tasks all use the same expensive model as the main conversation — wasteful.
- There is no orchestrator pattern where an expensive model (Opus) plans and delegates to cheaper models (Sonnet, Haiku) for execution.
### Model Tier Mapping
| Tier | Model | Use For |
|------|-------|---------|
| **complex** (orchestrator) | Claude Opus 4.6 | Planning, orchestration, complex reasoning, multi-step decisions |
| **default** (worker) | Claude Sonnet 4.5 | General conversation, tool use, code generation, channel adapters |
| **fast** (utility) | Claude Haiku 4.5 | Compaction summaries, memory extraction, classification, keyword extraction, formatting |
This maps directly to Flynn's existing `ModelTier` type. The infrastructure is already there — what's missing is the **delegation mechanism**.
### Design
#### Sub-agent spawning
Add the ability for `NativeAgent` to spawn **ephemeral sub-agents** that run a single task on a specific model tier and return the result:
```typescript
interface SubAgentRequest {
/** Which model tier to use for this subtask. */
tier: ModelTier;
/** System prompt for the sub-agent (task-specific). */
systemPrompt: string;
/** The task message. */
message: string;
/** Max tokens for the response. */
maxTokens?: number;
/** Whether to include tools. Default: false (most subtasks are pure text). */
tools?: boolean;
}
interface SubAgentResult {
content: string;
usage: TokenUsage;
tier: ModelTier;
}
```
The sub-agent is **stateless** — no session, no history, just a single request/response. It's a thin wrapper around `modelRouter.chat()` with a specific tier.
#### Where delegation happens
| Task | Delegated to | Reason |
|------|-------------|--------|
| Compaction summary | **fast** (Haiku) | Summarisation is a well-defined extraction task; doesn't need complex reasoning |
| Memory fact extraction | **fast** (Haiku) | Simple extraction from conversation text |
| Message classification | **fast** (Haiku) | "Is this a command, question, or statement?" — trivial |
| Tool result summarisation | **fast** (Haiku) | Condense verbose tool output before feeding back |
| Primary conversation | **default** (Sonnet) | General-purpose agent work |
| Complex planning/reasoning | **complex** (Opus) | Multi-step planning, architecture decisions, ambiguous requests |
| Sub-agent orchestration | **complex** (Opus) | When the agent decides to break a task into subtasks |
#### Automatic tier escalation
Add optional **auto-escalation** where the primary agent (Sonnet) can recognise it's struggling and escalate to Opus:
1. If the agent hits `maxIterations` without completing the task → escalate to `complex`.
2. If the agent's response contains explicit uncertainty markers ("I'm not sure", "This is beyond...") → offer escalation.
3. Configurable: `auto_escalate: true` in config.
This is a **future enhancement** — start with explicit delegation points (compaction, memory extraction) and add auto-escalation later.
#### AgentOrchestrator class
Create a new `AgentOrchestrator` that sits between the channel message handler and the `NativeAgent`:
```typescript
class AgentOrchestrator {
private primaryAgent: NativeAgent; // default tier (Sonnet)
private modelRouter: ModelRouter;
/** Spawn a sub-agent for a single-turn task on a specific tier. */
async delegate(request: SubAgentRequest): Promise<SubAgentResult>;
/** Process a user message — delegates to primary agent, which may internally delegate subtasks. */
async process(userMessage: string): Promise<string>;
}
```
The orchestrator replaces the current direct `NativeAgent` usage in the message router (`src/daemon/index.ts:139-186`).
#### Passing the orchestrator to tools and compaction
The key insight: **compaction and memory extraction don't need a new agent class** — they just need access to `modelRouter.chat(request, 'fast')`. The orchestrator provides a `delegate()` method that any subsystem can call:
```typescript
// In compaction.ts
const summary = await orchestrator.delegate({
tier: 'fast',
systemPrompt: COMPACTION_SYSTEM_PROMPT,
message: `Summarise this conversation:\n\n${messagesToCompact}`,
maxTokens: 1024,
});
// In memory extraction
const facts = await orchestrator.delegate({
tier: 'fast',
systemPrompt: MEMORY_EXTRACTION_PROMPT,
message: `Extract key facts from:\n\n${summary}`,
maxTokens: 512,
});
```
### New files
| File | Purpose |
|------|---------|
| `src/backends/native/orchestrator.ts` | `AgentOrchestrator` — sub-agent spawning and delegation |
| `src/backends/native/prompts.ts` | System prompts for delegated tasks (compaction, extraction, classification) |
### Changes to existing files
| File | Change |
|------|--------|
| `src/backends/native/agent.ts` | Accept optional `orchestrator` reference for internal delegation. Add `delegateSubtask()` method. |
| `src/daemon/index.ts` | Replace direct `NativeAgent` creation in `createMessageRouter()` with `AgentOrchestrator`. |
| `src/config/schema.ts` | Add `agents` config block for tier assignment and delegation policy. |
| `src/models/router.ts` | No changes needed — already supports `chat(request, tier)`. |
### Config additions
```yaml
agents:
primary_tier: default # Model tier for main conversation (Sonnet)
delegation:
compaction: fast # Tier for compaction summaries (Haiku)
memory_extraction: fast # Tier for memory fact extraction (Haiku)
classification: fast # Tier for message classification (Haiku)
tool_summarisation: fast # Tier for condensing tool output (Haiku)
complex_reasoning: complex # Tier for escalated reasoning (Opus)
auto_escalate: false # Future: auto-escalate on failure
max_delegation_depth: 3 # Prevent infinite delegation chains
```
### Implementation steps
1. Create `src/backends/native/orchestrator.ts`:
- Constructor takes `ModelRouter`, `systemPrompt`, `session`, `toolRegistry`, `toolExecutor`, delegation config.
- `delegate(request: SubAgentRequest): Promise<SubAgentResult>` — single-turn call to `modelRouter.chat()` with specified tier.
- `process(userMessage: string): Promise<string>` — delegates to internal `NativeAgent`.
- Tracks delegation depth to prevent loops.
- Logs tier usage for cost visibility.
2. Create `src/backends/native/prompts.ts` with task-specific system prompts.
3. Update `createMessageRouter()` in `src/daemon/index.ts` to use `AgentOrchestrator` instead of raw `NativeAgent`.
4. Add `agents` config block to schema.
5. Wire delegation config through to compaction (Phase 1) and memory (Phase 2).
6. Tests: delegation routing, tier selection, depth limiting.
### Cost implications
| Operation | Without delegation | With delegation |
|-----------|-------------------|-----------------|
| Compaction summary | Opus/Sonnet ($$$) | Haiku ($) |
| Memory extraction | Opus/Sonnet ($$$) | Haiku ($) |
| 10 classifications | Opus/Sonnet ($$$) | Haiku ($) |
| Complex reasoning | Sonnet ($$) | Opus ($$$) — but only when needed |
Net effect: **significant cost reduction** for background tasks, with targeted spend on complex reasoning only when it matters.
---
## Phase 1: Context Compaction (P0)
### Problem
Flynn sends the **entire session history** to the model on every turn. There is no summarisation, trimming, or token budgeting. Once a conversation exceeds the model's context window, it fails hard.
**Current flow** (`src/backends/native/agent.ts:92-165`):
```
toolLoop() → loopMessages = full this.history → send to model
```
The `SessionStore` (`src/session/store.ts`) and `ManagedSession` (`src/session/manager.ts`) store every message verbatim and replay them all on load.
### Design
#### Token counting
Add a `tokenCount` utility that estimates token counts per message. Two strategies:
1. **Cheap estimate** — character-based heuristic (`chars / 4` for English). Good enough for budgeting.
2. **Accurate count** — use the Anthropic SDK's `count_tokens` or `tiktoken` for OpenAI. Only needed if we want precise billing.
Start with the cheap estimate; add accurate counting later behind a flag.
#### Compaction strategy
Use a **summarise-and-replace** approach (same as OpenClaw):
1. When total estimated tokens exceed a **compaction threshold** (configurable, default: 80% of model's context window), trigger compaction.
2. Take all messages **except the last N turns** (configurable, default: 4 turns).
3. **Delegate** the summarisation request to the **fast tier (Haiku)** via `orchestrator.delegate()`: "Summarise this conversation so far, preserving key facts, decisions, and context." This is a well-defined extraction task that doesn't need complex reasoning.
4. Replace the older messages with a single `[system_summary]` message.
5. Persist the compacted history to SQLite (replace the old messages).
#### Where compaction runs
Compaction is a concern of `AgentOrchestrator` (Phase 0), not the session store. The orchestrator decides when to compact based on the model it's using, and delegates the summary generation to the **fast** tier via `orchestrator.delegate({ tier: 'fast', ... })`.
#### New files
| File | Purpose |
|------|---------|
| `src/context/tokens.ts` | Token estimation utilities |
| `src/context/compaction.ts` | Compaction logic (summarise + replace) |
#### Changes to existing files
| File | Change |
|------|--------|
| `src/backends/native/agent.ts` | Add `compactIfNeeded()` call before building `loopMessages`. Add compaction config to `NativeAgentConfig`. |
| `src/session/manager.ts` | Add `ManagedSession.replaceHistory(messages)` method for compaction to persist the compacted state. |
| `src/session/store.ts` | Add `replaceMessages(sessionId, messages)` — atomic delete + re-insert in a transaction. |
| `src/models/types.ts` | Add optional `contextWindow` field to `ChatResponse` or create a `ModelCapabilities` type. |
| `src/config/schema.ts` | Add `compaction` config block: `{ enabled, threshold_pct, keep_turns, summary_model? }`. |
| `src/daemon/index.ts` | Pass compaction config to agent creation. |
#### Config additions
```yaml
compaction:
enabled: true
threshold_pct: 80 # Trigger at 80% of context window
keep_turns: 4 # Always keep the last 4 exchanges
# summary_tier is configured in agents.delegation.compaction (default: fast/Haiku)
```
#### Chat commands
| Command | Description |
|---------|-------------|
| `/compact` | Force compaction of the current session immediately. |
#### Implementation steps
1. Create `src/context/tokens.ts` with `estimateTokens(text: string): number` and `estimateMessageTokens(messages: Message[]): number`.
2. Create `src/context/compaction.ts` with `compactHistory(opts: CompactionOpts): Promise<Message[]>`:
- Takes messages, orchestrator (for delegation), keep_turns.
- Calls `orchestrator.delegate({ tier: 'fast', ... })` for the summary.
- Returns `[summaryMessage, ...recentMessages]`.
3. Add `replaceMessages()` to `SessionStore`.
4. Add `replaceHistory()` to `ManagedSession`.
5. Add compaction config to schema.
6. Wire `compactIfNeeded()` into `AgentOrchestrator.process()` — called before building the request, checks token budget.
7. Add `/compact` command handling in the message router.
8. Tests: token estimation accuracy, compaction trigger logic, history replacement, delegation to fast tier.
#### Model context window sizes
Hard-code a lookup table in `src/context/tokens.ts`:
```typescript
const CONTEXT_WINDOWS: Record<string, number> = {
'claude-sonnet-4-20250514': 200_000,
'claude-3-5-haiku-20241022': 200_000,
'gpt-4o': 128_000,
'gpt-4o-mini': 128_000,
// ... etc
};
```
Allow override in config: `models.default.context_window: 128000`.
---
## Phase 2: Memory System (P0)
### Problem
Flynn has no persistent knowledge across sessions. Every new session starts blank. The agent can't remember user preferences, past decisions, or accumulated knowledge.
### Design
A lightweight memory system with three layers:
1. **Memory files** — Markdown files that the agent can read/write (like OpenClaw's `MEMORY.md`).
2. **Memory tools**`memory.read`, `memory.write`, `memory.search` builtin tools.
3. **Auto-indexing** — After compaction, key facts are extracted and appended to memory.
#### Storage
Use a dedicated SQLite table in the existing `sessions.db` (or a separate `memory.db`):
```sql
CREATE TABLE memory_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT, -- NULL for global memories
namespace TEXT NOT NULL, -- 'user', 'facts', 'preferences', etc.
key TEXT NOT NULL,
content TEXT NOT NULL,
embedding BLOB, -- Future: vector embedding for search
created_at INTEGER NOT NULL DEFAULT (unixepoch()),
updated_at INTEGER NOT NULL DEFAULT (unixepoch())
);
CREATE INDEX idx_memory_ns ON memory_entries(namespace);
CREATE INDEX idx_memory_session ON memory_entries(session_id);
```
#### Phase 2a: File-based memory (MVP)
The simplest useful memory: a markdown file per namespace in `~/.local/share/flynn/memory/`.
```
~/.local/share/flynn/memory/
├── global.md # Cross-session knowledge
├── user.md # User preferences, facts about the user
└── sessions/
└── {session_id}.md # Per-session notes
```
#### Memory tools
| Tool | Description |
|------|-------------|
| `memory.read` | Read a memory file by namespace. Args: `{ namespace: string }` |
| `memory.write` | Append to or replace a memory file. Args: `{ namespace: string, content: string, mode: 'append' \| 'replace' }` |
| `memory.search` | Search across all memory files for a keyword. Args: `{ query: string }`. Returns matching lines with context. |
#### Phase 2b: Vector search (future)
Defer vector embeddings and semantic search to a later phase. The file-based approach with keyword search covers 80% of use cases.
When implemented:
- Add `sqlite-vec` or similar for vector storage
- Embed memory entries on write using the configured model's embedding API
- Hybrid search: keyword (BM25) + vector similarity
#### System prompt integration
On every agent turn, inject a `[Memory Context]` section into the system prompt:
```
# Memory Context
The following is your persistent memory. Use it to maintain continuity across sessions.
## User
{contents of user.md, truncated to ~1000 tokens}
## Global
{contents of global.md, truncated to ~1000 tokens}
```
This is injected dynamically by the agent before each request, not baked into the static system prompt.
#### Auto-extraction after compaction
When compaction runs (Phase 1), add a follow-up step using the **fast tier (Haiku)** via `orchestrator.delegate()`:
1. Along with the summary, delegate to Haiku to extract any **new facts worth remembering** (user preferences, decisions, names, etc.). This is a simple extraction task — no need for Sonnet/Opus.
2. Append extracted facts to `user.md` or `global.md`.
This creates a natural knowledge accumulation loop: conversation → compaction (Haiku) → memory extraction (Haiku) → next session gets richer context.
The cost of these background operations is minimal since they run on the cheapest model tier.
#### New files
| File | Purpose |
|------|---------|
| `src/memory/store.ts` | MemoryStore class — read/write/search markdown files |
| `src/memory/index.ts` | Exports |
| `src/tools/builtin/memory-read.ts` | `memory.read` tool |
| `src/tools/builtin/memory-write.ts` | `memory.write` tool |
| `src/tools/builtin/memory-search.ts` | `memory.search` tool |
#### Changes to existing files
| File | Change |
|------|--------|
| `src/tools/builtin/index.ts` | Register memory tools in `allBuiltinTools` |
| `src/backends/native/orchestrator.ts` | Inject memory context into system prompt before each request |
| `src/context/compaction.ts` | Add memory extraction step after summarisation (delegates to fast tier) |
| `src/daemon/index.ts` | Initialize MemoryStore, pass to orchestrator config |
| `src/config/schema.ts` | Add `memory` config block: `{ enabled, dir, namespaces, auto_extract }` |
#### Config additions
```yaml
memory:
enabled: true
dir: ~/.local/share/flynn/memory
auto_extract: true # Extract facts during compaction
max_context_tokens: 2000 # Max tokens injected per turn from memory
```
#### Implementation steps
1. Create `src/memory/store.ts`:
- `read(namespace): string` — read file contents
- `write(namespace, content, mode): void` — append or replace
- `search(query): SearchResult[]` — line-by-line keyword match with context
- `listNamespaces(): string[]`
2. Create memory tools (3 files).
3. Register tools.
4. Add memory context injection to `NativeAgent` — load memory before building the request, inject into system prompt.
5. Add memory extraction to compaction flow.
6. Tests: memory CRUD, search, injection, extraction.
---
## Phase 3: Messaging Channels (P1)
### Problem
Flynn has only Telegram and WebChat. The three most requested channels are WhatsApp, Discord, and Slack.
### Design approach
Flynn's `ChannelAdapter` interface (`src/channels/types.ts:51-69`) is clean and well-defined. Adding a new channel means:
1. Implement `ChannelAdapter` (5 methods: `name`, `status`, `connect()`, `disconnect()`, `send()`, `onMessage()`).
2. Add config section.
3. Register in daemon startup.
Each channel is independent — implement in any order.
### 3a: Discord
**Library:** `discord.js` v14
**Effort:** 12 days
#### Config
```yaml
discord:
bot_token: ${DISCORD_BOT_TOKEN}
allowed_guild_ids: [] # Empty = all guilds
allowed_channel_ids: [] # Empty = all channels
```
#### New files
| File | Purpose |
|------|---------|
| `src/channels/discord/adapter.ts` | DiscordAdapter implementing ChannelAdapter |
| `src/channels/discord/index.ts` | Exports |
#### Key decisions
- **Peer ID:** Use `channelId` (not `userId`) so the agent maintains separate sessions per Discord channel.
- **Message chunking:** Discord has a 2000-char limit. Chunk long responses.
- **Mentions:** Only respond when mentioned (`@Flynn`) or in DMs. Configurable.
- **Slash commands:** Register `/reset` and `/status` as Discord slash commands.
#### Implementation steps
1. Add `discord.js` dependency.
2. Create `DiscordAdapter` class.
3. Add config schema for `discord` section.
4. Register in daemon if `config.discord.bot_token` is set.
5. Export from `src/channels/index.ts`.
6. Test with a bot in a private server.
### 3b: Slack
**Library:** `@slack/bolt` (Bolt for JavaScript)
**Effort:** 12 days
#### Config
```yaml
slack:
bot_token: ${SLACK_BOT_TOKEN}
app_token: ${SLACK_APP_TOKEN} # For Socket Mode
signing_secret: ${SLACK_SIGNING_SECRET}
allowed_channel_ids: []
```
#### New files
| File | Purpose |
|------|---------|
| `src/channels/slack/adapter.ts` | SlackAdapter implementing ChannelAdapter |
| `src/channels/slack/index.ts` | Exports |
#### Key decisions
- **Socket Mode** for self-hosted deployments (no public URL needed). Falls back to HTTP events if `app_token` not set.
- **Peer ID:** `channelId:threadTs` to isolate threaded conversations.
- **Message chunking:** Slack has a 40,000-char limit with blocks. Use `mrkdwn` formatting.
- **Slash commands:** `/flynn-reset`, `/flynn-status`.
### 3c: WhatsApp
**Library:** `whatsapp-web.js` (or `@whiskeysockets/baileys` for full WhatsApp Web protocol)
**Effort:** 23 days (more complex due to QR auth)
#### Config
```yaml
whatsapp:
auth_dir: ~/.local/share/flynn/whatsapp-auth
allowed_numbers: [] # E.164 format, empty = all
```
#### Key decisions
- **Auth flow:** WhatsApp Web requires QR code scanning on first connect. Display QR in terminal on startup.
- **Session persistence:** Store auth state in `auth_dir` so re-auth isn't needed on restart.
- **Peer ID:** Phone number (E.164).
- **Media:** Start with text-only; defer image/audio handling.
**WhatsApp is the most complex channel.** Consider doing Discord and Slack first, then WhatsApp.
### Shared channel infrastructure
Before implementing individual channels, extract any common patterns:
1. **Message chunking utility**`src/channels/utils/chunking.ts`: `chunkMessage(text: string, maxLen: number): string[]`
2. **Allowlist checking**`src/channels/utils/auth.ts`: `isAllowed(senderId: string, allowlist: string[]): boolean`
3. **Markdown adaptation**`src/channels/utils/markdown.ts`: Platform-specific markdown conversion (Discord uses different syntax from Telegram).
---
## Phase 4: Web Search Tool (P1)
### Problem
The agent has no way to search the web. This is one of the most commonly-used agent tools.
### Design
#### Provider options
| Provider | Pros | Cons |
|----------|------|------|
| **Brave Search API** | Free tier (2k/month), clean API, good results | Requires API key signup |
| **SearXNG** | Self-hosted, no API key, already running in homelab | Results quality varies |
| **Tavily** | Purpose-built for AI agents, great results | Paid only |
| **DuckDuckGo** | No API key needed | Unofficial API, rate limits |
**Recommendation:** Support Brave as primary, SearXNG as self-hosted alternative. Make the provider configurable.
#### Config
```yaml
tools:
web_search:
provider: brave # brave | searxng | tavily
api_key: ${BRAVE_SEARCH_API_KEY}
endpoint: null # Override for SearXNG: http://searxng:8080
max_results: 5
```
#### New files
| File | Purpose |
|------|---------|
| `src/tools/builtin/web-search.ts` | `web.search` tool |
#### Tool interface
```typescript
{
name: 'web.search',
description: 'Search the web for information. Returns titles, URLs, and snippets.',
inputSchema: {
type: 'object',
properties: {
query: { type: 'string', description: 'Search query' },
count: { type: 'number', description: 'Number of results (default 5, max 20)' },
},
required: ['query'],
},
}
```
#### Output format
```
1. **Title** — url
Snippet text...
2. **Title** — url
Snippet text...
```
Structured as markdown so the model can easily parse and reference results.
#### Implementation steps
1. Create `src/tools/builtin/web-search.ts`.
2. Add Brave Search API client (simple `fetch` — no SDK needed).
3. Add SearXNG support as alternative backend.
4. Add tool config section to schema.
5. Register in `allBuiltinTools`.
6. Tests: mock API responses, result formatting.
---
## Phase 5: Background Exec / Process Management (P1)
### Problem
Flynn's `shell.exec` (`src/tools/builtin/shell.ts`) is fire-and-forget: it runs a command, waits for it to finish (up to 30s timeout), and returns stdout/stderr. There's no way to:
- Run a long-running process (e.g., `npm run dev`)
- Check on a running process
- Read its ongoing output
- Kill it
### Design
Add a `process` tool family that manages background processes:
| Tool | Description |
|------|-------------|
| `process.start` | Start a command in the background. Returns a process ID. |
| `process.status` | Check if a process is running, exited, or errored. |
| `process.output` | Read recent stdout/stderr from a background process. |
| `process.kill` | Kill a background process. |
| `process.list` | List all managed background processes. |
#### Process manager
Create a `ProcessManager` class that maintains a registry of spawned processes:
```typescript
interface ManagedProcess {
id: string;
command: string;
cwd?: string;
pid: number;
status: 'running' | 'exited' | 'killed' | 'error';
exitCode?: number;
outputBuffer: RingBuffer; // Last N bytes of combined stdout+stderr
startedAt: number;
}
```
#### Output buffering
Use a ring buffer (circular buffer) to keep the last 64KB of output per process. This prevents memory leaks from long-running processes with verbose output.
#### Safety
- **Max processes:** Limit to 10 concurrent background processes.
- **Auto-cleanup:** Kill processes that have been running for more than 1 hour (configurable).
- **Shutdown cleanup:** Kill all managed processes on daemon shutdown.
- **Hook integration:** `process.start` should go through the confirmation engine (same as `shell.exec`).
#### New files
| File | Purpose |
|------|---------|
| `src/tools/builtin/process/manager.ts` | ProcessManager class |
| `src/tools/builtin/process/start.ts` | `process.start` tool |
| `src/tools/builtin/process/status.ts` | `process.status` tool |
| `src/tools/builtin/process/output.ts` | `process.output` tool |
| `src/tools/builtin/process/kill.ts` | `process.kill` tool |
| `src/tools/builtin/process/list.ts` | `process.list` tool |
| `src/tools/builtin/process/index.ts` | Exports |
#### Changes to existing files
| File | Change |
|------|--------|
| `src/tools/builtin/index.ts` | Register process tools |
| `src/daemon/index.ts` | Create ProcessManager, pass to tool constructors, register shutdown handler |
| `src/config/schema.ts` | Add `process` config: `{ max_concurrent, max_runtime_minutes, buffer_size }` |
#### Implementation steps
1. Implement `RingBuffer` utility (or use an npm package like `ringbufferjs`).
2. Create `ProcessManager` class with spawn, track, kill, cleanup methods.
3. Implement 5 process tools.
4. Register tools and wire shutdown cleanup.
5. Tests: spawn + kill lifecycle, output buffering, max process limits.
---
## Phase 6: Enhanced web_fetch (P1)
### Problem
Flynn's `web.fetch` (`src/tools/builtin/web-fetch.ts:19-50`) is a bare `fetch()` call that returns raw HTML. This is nearly useless for LLMs — they need extracted text/markdown, not raw HTML with scripts and styles.
### Design
#### Enhancements
1. **HTML-to-markdown extraction** — Strip scripts/styles, convert to markdown using `@mozilla/readability` + `turndown`.
2. **Format parameter** — Let the agent choose: `text`, `markdown` (default), or `html`.
3. **Response caching** — Cache fetched pages for 5 minutes to avoid redundant requests in tool loops.
4. **Redirect following** — Already handled by `fetch()`, but add a max redirect limit.
5. **Content type handling** — Return JSON prettified, plain text as-is, HTML converted.
#### Libraries
| Package | Purpose |
|---------|---------|
| `turndown` | HTML → Markdown converter |
| `linkedom` | Lightweight DOM implementation (for Readability) |
| `@mozilla/readability` | Extract article content from HTML |
Using `linkedom` instead of `jsdom` — it's much lighter and sufficient for content extraction.
#### Tool interface update
```typescript
{
name: 'web.fetch',
description: 'Fetch a URL and extract its content. Returns clean text/markdown by default, not raw HTML.',
inputSchema: {
type: 'object',
properties: {
url: { type: 'string', description: 'The URL to fetch' },
format: { type: 'string', enum: ['markdown', 'text', 'html'], description: 'Output format (default: markdown)' },
timeout: { type: 'number', description: 'Timeout in milliseconds (default 15000)' },
},
required: ['url'],
},
}
```
#### Caching
Simple in-memory cache with TTL:
```typescript
const cache = new Map<string, { content: string; timestamp: number }>();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes
```
#### Changes to existing files
| File | Change |
|------|--------|
| `src/tools/builtin/web-fetch.ts` | Major rewrite — add extraction, caching, format parameter |
#### Implementation steps
1. Add `turndown`, `linkedom`, `@mozilla/readability` dependencies.
2. Create extraction pipeline: fetch → parse DOM → readability → turndown → clean markdown.
3. Add format parameter handling.
4. Add response caching.
5. Update tool description to reflect new capabilities.
6. Tests: extraction from sample HTML, caching behaviour, format handling.
---
## Implementation Order
```
Week 1: Phase 0 (Multi-Model Delegation) ─────────────────────── P0 (foundational)
Week 2: Phase 1 (Context Compaction) ─────────────────────────── P0 (uses delegation)
Week 3: Phase 2 (Memory System) ──────────────────────────────── P0 (uses delegation)
Week 4: Phase 4 (Web Search) + Phase 6 (Enhanced web_fetch) ─── P1 (quick wins)
Week 5: Phase 5 (Process Management) ─────────────────────────── P1
Week 6+: Phase 3 (Channels: Discord → Slack → WhatsApp) ──────── P1
```
**Rationale:**
- **Delegation first** — Phase 0 is foundational. Compaction and memory both need to delegate subtasks to cheaper models. Building the orchestrator first means Phase 1 and 2 can use it immediately.
- Compaction and memory are sequential (memory extraction depends on compaction).
- Web search and enhanced web_fetch are small, independent, and immediately useful — do them as palate cleansers between the big features.
- Process management is self-contained.
- Channels are the largest body of work but each is independent — can be done in parallel or interleaved.
### Model usage across all phases
| Phase | Primary model (user-facing) | Delegated tasks | Delegation tier |
|-------|---------------------------|-----------------|-----------------|
| 0 | Sonnet (default) | Sub-agent infrastructure | N/A (infrastructure) |
| 1 | Sonnet (default) | Compaction summaries | Haiku (fast) |
| 2 | Sonnet (default) | Memory fact extraction | Haiku (fast) |
| 3 | Sonnet (default) | Message classification, markdown adaptation | Haiku (fast) |
| 4 | Sonnet (default) | None (direct API call) | N/A |
| 5 | Sonnet (default) | None | N/A |
| 6 | Sonnet (default) | None | N/A |
Opus (complex) is reserved for **user-facing tasks** that require deep reasoning — it's never used for background operations.
---
## Testing Strategy
Each phase should include:
1. **Unit tests** — Pure logic (token estimation, ring buffer, markdown extraction, memory search).
2. **Integration tests** — Tool execution with mocked model responses.
3. **Manual smoke test** — Run via TUI and Telegram to verify end-to-end.
Key test files to create:
| Test file | Covers |
|-----------|--------|
| `src/backends/native/orchestrator.test.ts` | Delegation routing, tier selection, depth limiting, cost tracking |
| `src/context/tokens.test.ts` | Token estimation accuracy |
| `src/context/compaction.test.ts` | Compaction trigger logic, summary replacement, fast-tier delegation |
| `src/memory/store.test.ts` | Memory CRUD, search |
| `src/tools/builtin/web-search.test.ts` | API mocking, result formatting |
| `src/tools/builtin/process/manager.test.ts` | Process lifecycle, cleanup |
| `src/tools/builtin/web-fetch.test.ts` | HTML extraction, caching |
---
## Risk Assessment
| Risk | Impact | Mitigation |
|------|--------|------------|
| Haiku summaries lose critical context vs Sonnet | High | Validate quality; use detailed extraction prompts; allow per-task tier override in config |
| Delegation depth spirals (agent delegates to agent that delegates...) | Medium | Hard limit `max_delegation_depth: 3`; sub-agents cannot spawn sub-agents |
| Fast tier unavailable (Haiku rate limit / outage) | Medium | Fallback to default tier for delegation; log the fallback cost increase |
| Compaction summaries lose critical context | High | Keep last 4 turns intact; allow user to adjust `keep_turns`; log what was compacted |
| Memory injection bloats system prompt | Medium | Hard cap on injected memory tokens; truncate oldest entries |
| WhatsApp auth flow is fragile | Medium | Defer WhatsApp to last; use battle-tested Baileys library |
| Brave Search free tier limits (2k/month) | Low | SearXNG as free self-hosted fallback |
| Background processes leak resources | Medium | Max process limit, auto-kill timeout, shutdown cleanup |
| HTML extraction fails on JS-heavy sites | Low | Accept graceful degradation; defer CDP/browser fallback to P3 |