flynn/.planning/codebase/CONCERNS.md

# Codebase Concerns

**Analysis Date:** 2025-02-09

## Tech Debt

**God File: `src/daemon/index.ts` (1087 lines):**
- Issue: Single file handles all service wiring — model client creation, channel setup, agent factory, memory initialization, vector indexer, session pruning, lifecycle management, and graceful shutdown. This is the most complex file in the codebase.
- Files: `src/daemon/index.ts`
- Impact: Any change to wiring logic requires editing this file. High risk of merge conflicts. Difficult to test individual setup phases in isolation.
- Fix approach: Extract service factories into separate modules (e.g., `src/daemon/models.ts`, `src/daemon/channels.ts`, `src/daemon/memory.ts`). Keep `index.ts` as a thin composition root that calls into extracted modules.

**No ESLint Configuration:**
- Issue: No `eslint.config.js`, `.eslintrc.*`, or any ESLint configuration file exists. The `pnpm lint` command exists in package.json but has no config to enforce rules.
- Files: Project root (missing `eslint.config.js`)
- Impact: No automated enforcement of code style, unused imports, or common error patterns. Relying solely on TypeScript compiler checks and developer discipline.
- Fix approach: Add `eslint.config.js` with flat config format. Start with `@typescript-eslint/recommended` and add project-specific rules incrementally.

**Pervasive `as any` Casting (~100+ instances):**
- Issue: Heavy use of `as any` and `as unknown as` type casts, concentrated in two areas:
  1. **Test files** (~70%): Mocking objects with `{} as any` or `mockObj as unknown as RealType`. This is common in test code but makes refactoring fragile — tests won't fail when interfaces change.
  2. **Model clients** (~15%): SDK types don't cover all features (e.g., Ollama thinking, WhatsApp `getChat()`, Discord `sendTyping()`).
  3. **Production code** (~15%): Type system workarounds like `loopMessages as unknown as Message[]` in `src/backends/native/agent.ts:136`.
- Files: `src/models/local/ollama.ts` (lines 36, 92, 159, 173), `src/channels/whatsapp/adapter.ts` (lines 105, 224, 263, 280), `src/channels/discord/adapter.ts` (lines 208, 219), `src/backends/native/agent.ts` (line 136), `src/models/anthropic.ts` (line 83), `src/models/bedrock.ts` (line 180)
- Impact: Type safety gaps. Refactoring won't catch breakage in cast sites. SDK upgrades may silently break at runtime.
- Fix approach: For test files, create proper mock factories with typed interfaces. For model clients, use module augmentation or wrapper types. For agent.ts, define a proper `LoopMessage` → `Message` adapter.

**Agent Cache Grows Unboundedly:**
- Issue: The `agents` Map in `src/daemon/index.ts:342` caches `AgentOrchestrator` instances keyed by `channel:senderId:agentConfig`. Entries are never evicted. Each entry holds a full agent with session history, token counts, and tool state.
- Files: `src/daemon/index.ts:342-453`
- Impact: Memory grows linearly with unique senders. In a multi-channel deployment with many users, this becomes a memory leak. Long-running daemon will accumulate stale agent instances.
- Fix approach: Add LRU eviction or TTL-based pruning (similar to session store's existing TTL pattern). Consider flushing agent state to SQLite and lazy-loading.

**Tool Executor Timer Leak:**
- Issue: `Promise.race` timeout in tool executor creates a `setTimeout` that is never cleared when the tool completes before timeout. The rejected promise's setTimeout continues to fire after resolution.
- Files: `src/tools/executor.ts:61-65`
- Impact: Minor: each tool execution leaks one timer until it fires (default timeout). Under heavy tool use, this creates unnecessary timer churn. The timeout promise's rejection is unhandled (but swallowed by Promise.race semantics).
- Fix approach: Use `AbortController` + `AbortSignal.timeout()` pattern, or explicitly `clearTimeout` in a `.finally()` block. See `src/tools/builtin/web-fetch.ts:200` for the correct pattern already used elsewhere in the codebase.

**Agent Cancellation Not Implemented:**
- Issue: `TODO: Wire AbortController into NativeAgent for actual cancellation` — the gateway handler sets a flag but the agent loop has no mechanism to check it or abort mid-iteration.
- Files: `src/gateway/handlers/agent.ts:89`, `src/backends/native/agent.ts`
- Impact: Users cannot cancel long-running agent tasks. The "cancel" API endpoint exists but is a no-op beyond setting a flag. This affects gateway/TUI UX.
- Fix approach: Pass an `AbortSignal` into `NativeAgent.run()`. Check `signal.aborted` between loop iterations and before each tool execution. Propagate signal to model client `chat()` calls.

**PairingManager State is Ephemeral:** *RESOLVED*
- Resolution: `PairingStore` interface added to `PairingManager` with SQLite implementation via `SessionStore.getPairingStore()`. Approved senders are persisted in the `pairing_approved` table and survive daemon restarts. Pending codes remain in-memory (short-lived by design).
- Commits: `1e1a689`, `ecd3aca`, `62331c3`
- Files: `src/channels/pairing.ts`, `src/session/store.ts`, `src/daemon/services.ts`, `src/daemon/index.ts`

**Hardcoded Anthropic → GitHub Model Mapping:**
- Issue: `anthropicToGitHubModel()` contains a hardcoded mapping table that must be manually updated for each new Anthropic model release. A generic fallback regex exists but only handles the date-suffix stripping pattern.
- Files: `src/daemon/index.ts:155-175`
- Impact: New Anthropic model releases require a code change to work with GitHub Models provider. The fallback regex may produce incorrect names for models that don't follow the `name-N-N-YYYYMMDD` convention.
- Fix approach: Move mapping to config file or allow user overrides in YAML config. The generic fallback is a good start but should be validated against known GitHub model names.

**Duplicated Content Conversion Logic Across Model Clients:**
- Issue: Each model client independently implements `Message` → provider-specific format conversion, tool call parsing, and response normalization. The same JSON.parse pattern for tool arguments appears in 3+ clients.
- Files: `src/models/github.ts`, `src/models/openai.ts`, `src/models/local/llamacpp.ts`, `src/models/gemini.ts`, `src/models/anthropic.ts`, `src/models/bedrock.ts`
- Impact: Bug fixes or format changes must be applied across all clients. Easy to miss one. Different error handling behavior per client.
- Fix approach: Extract shared conversion utilities (e.g., `parseToolArguments()`, `convertMessages()`) into `src/models/shared.ts` or `src/models/utils.ts`.

## Known Bugs

**Tool Executor Timeout Creates Orphaned Promise:**
- Symptoms: When a tool completes before the timeout, the `setTimeout` callback fires after the race resolves and creates an unhandled rejection (the `reject` from the timeout promise).
- Files: `src/tools/executor.ts:63-65`
- Trigger: Any tool that completes normally while a timeout is pending.
- Workaround: Node.js swallows the unhandled rejection from the losing Promise.race branch in practice, but it's technically incorrect.

## Security Considerations

**File Tools Have No Path Restrictions (CRITICAL):**
- Risk: `file.read`, `file.write`, `file.edit`, `file.list` tools can access ANY path on the filesystem. An AI agent (or a prompt injection via user input) could read sensitive files (`/etc/shadow`, `~/.ssh/id_rsa`, `.env`) or write to critical system paths.
- Files: `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-list.ts`
- Current mitigation: Tool policy system (`src/tools/policy.ts`) can deny specific tools entirely, but there is NO path-level restriction. Hooks can prompt for confirmation on specific tools.
- Recommendations: Add configurable path allowlist/denylist to file tools. At minimum, deny access to common sensitive paths (`/etc/shadow`, `~/.ssh/`, any `.env` file). Consider a sandbox directory approach where file tools only operate within configured working directories.

**`browser.evaluate` Uses `eval()` (HIGH):**
- Risk: The `browser.evaluate` tool calls `eval(expr)` in the browser page context. While this executes in the Puppeteer browser sandbox (not the Node.js process), it allows arbitrary JavaScript execution on whatever page the browser has navigated to.
- Files: `src/tools/builtin/browser/tools.ts:226`
- Current mitigation: Tool policy can deny `browser.evaluate` entirely. The browser runs in a sandboxed Puppeteer context.
- Recommendations: Consider restricting to a safer evaluation mechanism or adding URL allowlists for browser navigation.

**Gateway Token Comparison Vulnerable to Timing Attacks (MEDIUM):**
- Risk: Token authentication in `src/gateway/auth.ts` uses `===` for string comparison (lines 38, 43). This is theoretically vulnerable to timing attacks where an attacker can deduce the token character by character based on comparison timing.
- Files: `src/gateway/auth.ts:38,43`
- Current mitigation: The gateway is typically accessed over Tailscale (private network), reducing attack surface.
- Recommendations: Use `crypto.timingSafeEqual()` for token comparison. The codebase already uses this correctly in `src/automation/webhooks.ts` — apply the same pattern here.

**Unguarded JSON.parse on Tool Call Arguments (MEDIUM):**
- Risk: Multiple model clients call `JSON.parse(tc.function.arguments)` without try-catch. If a model returns malformed JSON in tool arguments, this throws an unhandled exception that could crash the agent loop.
- Files: `src/models/github.ts:151,268`, `src/models/openai.ts:96`, `src/models/local/llamacpp.ts:127,262`
- Current mitigation: None. The agent loop's outer try-catch may catch this, but the error message won't be helpful.
- Recommendations: Wrap all tool argument parsing in try-catch with a fallback to `{}` or a descriptive error. Extract a shared `safeParseToolArgs()` utility.

**No Rate Limiting on Gateway (LOW-MEDIUM):**
- Risk: The WebSocket and HTTP gateway endpoints have no rate limiting. A malicious or misconfigured client could flood the daemon with requests, causing resource exhaustion.
- Files: `src/gateway/server.ts`
- Current mitigation: Authentication is required (token or Tailscale identity). The lane queue serializes per-session, which limits concurrency per sender.
- Recommendations: Add connection-level rate limiting (max connections per IP, max messages per minute per session). Consider using the `ws` server's built-in `maxPayload` option and adding message rate tracking.

## Performance Bottlenecks

**Memory Context Rebuilt on Every Message:**
- Problem: `_injectMemoryContext()` is called at the start of every `process()` call in the orchestrator. It reads from the memory store (which reads files from disk via `getContextForPrompt()`) and rebuilds the system prompt string by concatenating base prompt + memory context.
- Files: `src/backends/native/orchestrator.ts:218,347-360`, `src/memory/store.ts`
- Cause: No caching or dirty-checking. Even if memory hasn't changed since last message, the full read + concatenation happens again.
- Improvement path: Cache the enriched system prompt and only rebuild when memory store signals a change (e.g., via a dirty flag or version counter on the store).

**Background Vector Indexer Runs Unconditionally Every 30 Seconds:**
- Problem: The vector indexer `setInterval` fires every 30 seconds regardless of whether any namespaces are dirty. While `getDirtyNamespaces()` returns an empty array quickly, the interval itself and the function call overhead are unnecessary when idle.
- Files: `src/daemon/index.ts:606-637`
- Cause: Simple polling pattern with fixed interval.
- Improvement path: Use event-driven approach — trigger indexing when `memoryStore.write()` is called. Or at minimum, skip the loop body immediately when dirty set is empty (which it already does, so this is low priority).

**Web-Fetch Cache Has No Size Limit:**
- Problem: The module-level `cache` Map in web-fetch grows without bound. Entries are only evicted by TTL (5 minutes) via lazy expiry on cache hits. If many unique URLs are fetched, the cache grows until TTL-based eviction catches up.
- Files: `src/tools/builtin/web-fetch.ts:37`
- Cause: No max-size constraint on the cache Map. Eviction only happens on read (lazy).
- Improvement path: Add a max entry count (e.g., 100). Use LRU eviction when limit is reached. Or add periodic eviction via the existing `evictExpired()` function on a timer.

## Fragile Areas

**Model Client Content Format Handling:**
- Files: `src/backends/native/agent.ts:127-136`, `src/models/anthropic.ts:83`, `src/models/gemini.ts:206`
- Why fragile: The `LoopMessage` type carries structured content (tool_use blocks, tool_result blocks) that must be cast to `Message[]` via `as unknown as`. Each model client then re-interprets this structure differently. Adding a new content block type (e.g., images) requires changes in every client.
- Safe modification: When adding new content types, test against ALL model providers, not just the primary one. The cast at agent.ts:136 hides type mismatches.
- Test coverage: Agent loop is tested (`agent.test.ts`) but individual model client content conversion is only partially tested.

**Session Store Stores Structured Content as TEXT:**
- Files: `src/session/store.ts`
- Why fragile: Session messages with structured content (tool_use, tool_result blocks) are serialized to TEXT columns in SQLite. Deserialization must exactly match the serialization format. Schema changes to message content structure require migration.
- Safe modification: Always verify round-trip serialization when changing message content types. Add integration tests for session save/restore with tool_use content.
- Test coverage: Session store has tests but they may not cover all content block variations.

**Compaction Delegation Has No Circuit Breaker:**
- Files: `src/context/compaction.ts:72-91`, `src/backends/native/orchestrator.ts:366-402`
- Why fragile: Compaction delegates to a sub-agent (fast tier) for summarization and memory extraction. If the sub-agent call fails (model error, timeout, rate limit), the error is caught and logged but the original (uncompacted) history remains. Repeated failures will cause the context to grow until it exceeds the model's context window, at which point messages will fail entirely.
- Safe modification: Add retry logic or a fallback compaction strategy (e.g., simple truncation) when delegation fails.
- Test coverage: Compaction has unit tests but no integration test for delegation failure scenarios.

## Scaling Limits

**Agent Cache (Memory):**
- Current capacity: One `AgentOrchestrator` + `OutboundAttachmentCollector` per unique `channel:senderId:agentConfig` combination.
- Limit: Memory-bound. Each agent holds full conversation history (until compacted). With 100+ active users, memory usage becomes significant.
- Scaling path: LRU eviction with configurable max entries. Persist agent state to SQLite for cold sessions.

**SQLite Session Store:**
- Current capacity: Single SQLite database for all sessions across all channels.
- Limit: SQLite write throughput (~100 writes/sec). With high-volume multi-channel deployment, write contention may occur.
- Scaling path: WAL mode is likely already enabled. For higher scale, consider per-channel databases or migration to a server database.

## Dependencies at Risk

**WhatsApp Web.js (`whatsapp-web.js`):**
- Risk: Unofficial WhatsApp client library that reverse-engineers the WhatsApp Web protocol. Can break on any WhatsApp update without warning. Heavy `as any` casting in the adapter suggests the types are incomplete.
- Impact: WhatsApp channel becomes unavailable until library is updated.
- Migration plan: No official WhatsApp Node.js SDK alternative. Consider the WhatsApp Business API (Cloud API) for a more stable integration.

## Missing Critical Features

**No Structured Logging:**
- Problem: All logging uses `console.log`, `console.error`, `console.warn` throughout the codebase. No structured logging framework (winston, pino, etc.).
- Blocks: Log aggregation, filtering by severity, JSON log parsing by monitoring tools, correlation IDs for request tracing.

**No Health Check Endpoint:**
- Problem: The gateway server has no `/health` or `/ready` endpoint for container orchestration or monitoring.
- Blocks: Kubernetes liveness/readiness probes, load balancer health checks, uptime monitoring.

## Test Coverage Gaps

**File Tools (ALL untested):**
- What's not tested: `file.read`, `file.write`, `file.edit`, `file.list` — the tools that interact with the filesystem.
- Files: `src/tools/builtin/file-read.ts`, `src/tools/builtin/file-write.ts`, `src/tools/builtin/file-edit.ts`, `src/tools/builtin/file-list.ts`
- Risk: Path handling edge cases, permission errors, large file handling, and (critically) any path restriction logic added in the future.
- Priority: High — these are security-sensitive tools.

**Memory Tools (ALL untested):**
- What's not tested: `memory.read`, `memory.write`, `memory.search` tool wrappers.
- Files: `src/tools/builtin/memory-read.ts`, `src/tools/builtin/memory-write.ts`, `src/tools/builtin/memory-search.ts`
- Risk: Namespace validation, content truncation, error handling for missing namespaces.
- Priority: Medium

**Process Tools (ALL untested):**
- What's not tested: `process.start`, `process.kill`, `process.list`, `process.output`, `process.status` — tools for managing background processes.
- Files: `src/tools/builtin/process/start.ts`, `src/tools/builtin/process/kill.ts`, `src/tools/builtin/process/list.ts`, `src/tools/builtin/process/output.ts`, `src/tools/builtin/process/status.ts`
- Risk: Process lifecycle edge cases, PID reuse, zombie processes. Note: `src/tools/builtin/process/manager.ts` IS tested.
- Priority: Medium

**Gateway Handlers (ALL untested individually):**
- What's not tested: Individual handler modules — agent, config, pairing, sessions, system, tools handlers.
- Files: `src/gateway/handlers/agent.ts`, `src/gateway/handlers/config.ts`, `src/gateway/handlers/pairing.ts`, `src/gateway/handlers/sessions.ts`, `src/gateway/handlers/system.ts`, `src/gateway/handlers/tools.ts`
- Risk: Request validation, error responses, edge cases in each handler. Note: `src/gateway/handlers/handlers.test.ts` exists and tests the composed handler set, but individual handler logic is not unit-tested.
- Priority: Low-Medium (integration tests via handlers.test.ts provide some coverage)

**MCP Client (untested):**
- What's not tested: MCP protocol client for connecting to external tool servers.
- Files: `src/mcp/client.ts`
- Risk: Connection lifecycle, protocol errors, tool discovery failures. Note: `src/mcp/bridge.test.ts` tests the bridge layer with mocked clients.
- Priority: Medium

**GitHub Models Client (untested):**
- What's not tested: GitHub Models provider with streaming and non-streaming chat.
- Files: `src/models/github.ts`
- Risk: Unguarded `JSON.parse` on tool arguments (lines 151, 268), streaming chunk assembly, error handling for API failures.
- Priority: Medium — this is a primary model provider.

---

*Concerns audit: 2025-02-09*