From 157e99ccb57e2de60c4dd5ac62cde8ef0fee765a Mon Sep 17 00:00:00 2001
From: William Valentin <william.valentin.info@gmail.com>
Date: Sun, 15 Feb 2026 21:37:04 -0800
Subject: [PATCH] docs: add consolidated codebase audit report

---
 .../2026-02-16-codebase-audit-report.md       | 419 ++++++++++++++++++
 1 file changed, 419 insertions(+)
 create mode 100644 docs/plans/analysis/2026-02-16-codebase-audit-report.md

diff --git a/docs/plans/analysis/2026-02-16-codebase-audit-report.md b/docs/plans/analysis/2026-02-16-codebase-audit-report.md
new file mode 100644
index 0000000..1bdafab
--- /dev/null
+++ b/docs/plans/analysis/2026-02-16-codebase-audit-report.md
@@ -0,0 +1,419 @@
+# Flynn Codebase Audit Report
+
+Date: 2026-02-16  
+Scope: Production-risk-first audit of bugs, code improvements, and feature opportunities.
+
+## Executive Summary
+
+Current health snapshot:
+- `pnpm typecheck`: passing
+- `pnpm build`: passing
+- `pnpm test:run`: passing (`137/137` files, `1735/1735` tests)
+- `pnpm lint`: failing (`148 errors`, `530 warnings`)
+
+Top conclusions:
+- A critical Web UI security issue exists in markdown rendering (unsanitized HTML insertion).
+- Runtime configuration edits from the settings page appear non-persistent across restart.
+- Tool timeout behavior likely allows underlying side effects to continue after timeout.
+- Gateway request-body handling and WebSocket ingress controls need abuse protections.
+- Lint quality gates are currently broken at scale, reducing CI signal quality.
+
+## Methodology and Scope
+
+Method used:
+- Static inspection of core runtime, gateway, UI, and tooling modules.
+- Command validation via:
+  - `pnpm -s typecheck`
+  - `pnpm -s build`
+  - `pnpm -s test:run`
+  - `pnpm -s lint`
+
+Scope priorities:
+1. Security and production reliability
+2. Runtime correctness and operational safety
+3. Maintainability and engineering throughput
+4. New features with measurable user/ops value
+
+Non-goals:
+- No code changes in this step
+- No architecture rewrite proposals without incremental path
+
+## Findings (Bugs and Risks)
+
+### F-001 Critical: Web UI markdown XSS surface in chat rendering
+
+- Severity: Critical
+- Impact: Arbitrary script/HTML injection in browser clients if model/tool output includes malicious markup.
+- Evidence:
+  - `src/gateway/ui/pages/chat.js:63`
+  - `src/gateway/ui/pages/chat.js:610`
+- Detail:
+  - Assistant/system content is rendered via `marked.parse(...)` then inserted using `innerHTML`.
+  - No sanitizer or trusted-types/CSP-backed cleanup is applied before insertion.
+- Repro trigger:
+  - Assistant response containing HTML payload (for example, event handlers or scriptable tags) in markdown output path.
+- Recommended fix:
+  - Introduce HTML sanitization (DOMPurify or equivalent) between markdown parse and DOM insertion.
+  - Add restrictive CSP for UI surface.
+  - Disable raw HTML passthrough in markdown parser if not required.
+- Tests to add:
+  - Unit tests for sanitization of script/event-handler payloads.
+  - Integration test confirming rendered output excludes executable content.
+
+### F-002 High: Settings hook edits are runtime-only and likely not durable
+
+- Severity: High
+- Impact: Operators may believe settings were saved permanently when they only changed in-memory state.
+- Evidence:
+  - `src/gateway/ui/pages/settings.js:148`
+  - `src/gateway/handlers/config.ts:135`
+- Detail:
+  - `config.patch` mutates the active `config` object.
+  - No file write-back or persistence layer call is present in handler path.
+- Repro trigger:
+  - Save hook patterns from UI, restart process, observe values reset to file-backed config.
+- Recommended fix:
+  - Add persisted config patch pipeline (atomic write + validation + backup).
+  - Return `persisted: true|false` in response to avoid misleading UI status.
+- Tests to add:
+  - Restart-persistence integration test for `hooks.confirm/log/silent`.
+  - Error path test for rejected file write with unchanged runtime config.
+
+### F-003 High: Tool timeout does not cancel underlying execution
+
+- Severity: High
+- Impact: Timed-out tools can continue executing side effects after caller has received timeout failure.
+- Evidence:
+  - `src/tools/executor.ts:228`
+- Detail:
+  - Timeout is implemented with `Promise.race(...)`.
+  - No cancellation token/abort path is propagated to long-running tool implementations.
+- Repro trigger:
+  - Invoke a long-running side-effecting tool, timeout occurs, side effect completes anyway.
+- Recommended fix:
+  - Add abort signal support to tool contract and propagate cancellation in executor.
+  - Enforce cancellation support for high-risk tools (`shell.exec`, `process.start`, web/browser mutating tools).
+- Tests to add:
+  - Timeout + cancellation conformance tests for cancellable tools.
+  - Regression tests verifying no post-timeout side effects.
+
+### F-004 Medium: Lint quality gate is broken and concentrated in key runtime files
+
+- Severity: Medium
+- Impact: CI noise, reduced confidence in static analysis, and slower defect detection.
+- Evidence:
+  - `pnpm -s lint` => `148 errors`, `530 warnings`
+  - Error concentration:
+    - `src/daemon/models.ts` (90 errors)
+    - `src/cli/tui.ts` (25 errors)
+    - `src/daemon/routing.ts` (14 errors)
+    - `src/gateway/ui/pages/settings.js` (14 errors)
+- Detail:
+  - Most errors are style/indentation, but gate failure blocks meaningful lint signal.
+  - High warning volume includes many `any` and non-null assertions in runtime paths.
+- Recommended fix:
+  - Two-stage lint recovery:
+    1. Clear all error-level violations in top files.
+    2. Burn down warning debt in runtime (non-test) modules by policy.
+- Tests to add:
+  - CI check enforcing `eslint` errors = 0.
+  - Secondary threshold check for warning reduction trend.
+
+### F-005 Medium: ESLint browser globals mismatch causes avoidable UI lint failures
+
+- Severity: Medium
+- Impact: False-positive lint failures and friction for frontend changes.
+- Evidence:
+  - `src/gateway/ui/pages/chat.js:127` (`FileReader` not defined in lint)
+  - `eslint.config.js` browser globals block does not include `FileReader`.
+- Recommended fix:
+  - Add missing browser globals used by UI pages, or set UI files to browser env with parser options.
+- Tests to add:
+  - Lint regression case for UI page using file attachment APIs.
+
+### F-006 High: Request body reader has no size limit (memory DoS risk)
+
+- Severity: High
+- Impact: Large POST bodies can force unbounded buffering and memory pressure/crash.
+- Evidence:
+  - `src/gateway/server.ts:446`
+- Detail:
+  - `readRequestBody` concatenates all chunks without max-size guard.
+  - Path is used by inbound handlers (webhooks/Gmail push route handling).
+- Repro trigger:
+  - Send very large request body to exposed POST endpoint.
+- Recommended fix:
+  - Add strict max body size (for example 1 MB default, configurable).
+  - Stop reading and destroy request when threshold is exceeded.
+  - Return `413 Payload Too Large`.
+- Tests to add:
+  - Body-under-limit succeeds.
+  - Body-over-limit terminates early and returns 413.
+
+### F-007 Medium: Tool timeout race leaves orphan timers
+
+- Severity: Medium
+- Impact: Pending timeouts accumulate under load, increasing event-loop overhead.
+- Evidence:
+  - `src/tools/executor.ts:228`
+- Detail:
+  - Timeout promise created in `Promise.race` is never cleared when tool resolves quickly.
+  - This is distinct from the cancellation gap in F-003.
+- Recommended fix:
+  - Store timer handle and `clearTimeout(...)` in `finally` after race settles.
+- Tests to add:
+  - Repeated short tool calls do not grow active timer count.
+
+### F-008 Medium: WhatsApp channel launches Chromium without sandbox
+
+- Severity: Medium
+- Impact: Browser compromise has higher blast radius when sandbox is disabled.
+- Evidence:
+  - `src/channels/whatsapp/adapter.ts:95`
+- Detail:
+  - Puppeteer launch args include `--no-sandbox` and `--disable-setuid-sandbox`.
+- Recommended fix:
+  - Default to sandboxed mode.
+  - Gate no-sandbox behind explicit high-trust config with warning logs.
+- Tests to add:
+  - Config test verifies secure default launch args.
+
+### F-009 Medium: Gateway WebSocket ingress lacks rate limiting
+
+- Severity: Medium
+- Impact: Abuse can cause resource starvation and model API cost spikes.
+- Evidence:
+  - `src/gateway/server.ts:291`
+- Detail:
+  - Incoming WebSocket messages are processed without per-connection throttling.
+- Recommended fix:
+  - Add token-bucket or leaky-bucket rate limit by connection/session/IP.
+  - Add backoff/close behavior for repeated violations.
+- Tests to add:
+  - Burst traffic triggers throttling and does not starve normal traffic.
+
+### F-010 Medium: Compaction audit event logs token counts in message-count fields
+
+- Severity: Medium
+- Impact: Audit accuracy and observability metrics are corrupted.
+- Evidence:
+  - `src/backends/native/orchestrator.ts:311`
+- Detail:
+  - `sessionCompact` event uses token counts where message counts are expected.
+- Recommended fix:
+  - Emit actual pre/post message counts from compaction result.
+- Tests to add:
+  - Audit event schema/value validation for compaction events.
+
+### F-011 Low: Slack user-name cache has no eviction strategy
+
+- Severity: Low
+- Impact: Slow unbounded memory growth in long-running Slack deployments.
+- Evidence:
+  - `src/channels/slack/adapter.ts:66`
+- Recommended fix:
+  - Use LRU/TTL cache with maximum entry count.
+
+### F-012 Low-Medium: Synthetic tool nudge uses invalid `tool_use_id`
+
+- Severity: Low-Medium
+- Impact: Strict providers may reject tool_result blocks that reference unknown IDs.
+- Evidence:
+  - `src/backends/native/agent.ts:347`
+- Recommended fix:
+  - Emit nudge as plain assistant text or tie to valid preceding tool-use ID.
+
+### F-013 Low: Channel adapters duplicate pairing/reset/mention/chunking logic
+
+- Severity: Low
+- Impact: Behavior drift and higher maintenance cost across channel implementations.
+- Evidence:
+  - `src/channels/discord/adapter.ts`
+  - `src/channels/slack/adapter.ts`
+  - `src/channels/whatsapp/adapter.ts`
+- Recommended fix:
+  - Extract shared middleware utilities for common inbound/outbound behaviors.
+
+### F-014 Low: ModelRouter listener API has destructive setter footgun
+
+- Severity: Low
+- Impact: `setOnTierChange` silently replaces listeners added through append API.
+- Evidence:
+  - `src/models/router.ts:69`
+- Recommended fix:
+  - Deprecate/remove setter or explicitly document destructive semantics.
+
+### F-015 Low: Retry defaults classify timeouts as non-retryable
+
+- Severity: Low
+- Impact: Reduces resilience for transient timeout failures.
+- Evidence:
+  - `src/models/retry.ts:29`
+- Recommended fix:
+  - Remove timeout markers from non-retryable defaults or make behavior configurable.
+
+### F-016 Low: Duplicate request-body utility logic in two modules
+
+- Severity: Low
+- Impact: Duplicate maintenance surface and inconsistent future behavior.
+- Evidence:
+  - `src/gateway/server.ts:446`
+  - `src/automation/webhooks.ts:17`
+- Recommended fix:
+  - Centralize in shared utility with size-limit support.
+
+### F-017 Low: NativeAgent history getter returns mixed mutability semantics
+
+- Severity: Low
+- Impact: Caller behavior differs by storage path (by-reference vs copied array).
+- Evidence:
+  - `src/backends/native/agent.ts:72`
+- Recommended fix:
+  - Return immutable copy consistently across both paths.
+
+### F-018 Low: Session pruning interval has no jitter
+
+- Severity: Low
+- Impact: Multi-instance deployments can synchronize prune spikes.
+- Evidence:
+  - `src/daemon/index.ts:91`
+- Recommended fix:
+  - Add randomized jitter window to pruning schedule.
+
+### F-019 Low: `_restoreHistory` no-op when no session is attached
+
+- Severity: Low
+- Impact: Sessionless recovery path cannot rollback conversational state.
+- Evidence:
+  - `src/backends/native/orchestrator.ts:499`
+- Recommended fix:
+  - Keep explicit in-memory rollback fallback for sessionless mode.
+
+## Code Improvements (Prioritized Backlog)
+
+| Priority | Item | Why | Effort | Dependencies | Acceptance Criteria |
+|---|---|---|---|---|---|
+| P0 | Secure markdown render pipeline in gateway UI | Closes critical XSS class | M | UI markdown stack + sanitizer lib | All assistant/tool markdown is sanitized before `innerHTML`; security tests pass |
+| P0 | Add request-body hard limits in gateway/webhooks | Prevents memory DoS on exposed POST endpoints | S | Shared body reader utility | Oversized payloads are rejected with 413 and no unbounded buffering |
+| P0 | Persistent config patch service | Prevents config drift and operator confusion | M | Config loader/writer + schema validation | `config.patch` writes atomically, survives restart, reports persistence status |
+| P1 | Tool execution cancellation contract | Prevents post-timeout side effects | L | Tool type updates + executor + tool impls | Timeout triggers cancellation and verified no continued side effects |
+| P1 | Fix timeout timer leak in ToolExecutor | Prevents event-loop timer buildup under load | S | ToolExecutor timeout refactor | Timer is always cleared when tool execution settles |
+| P1 | WebSocket ingress rate limiting | Mitigates abuse and cost/resource exhaustion | M | Gateway connection/session identity | Excess message bursts are throttled or rejected deterministically |
+| P1 | Lint gate restoration program | Recovers static-analysis trust | M | ESLint config + high-error files | `pnpm lint` returns 0 errors on main branch |
+| P2 | Reduce runtime `any`/non-null usage | Lowers hidden runtime risk | M | Typed wrappers for channel/model SDKs | Warning count reduced in non-test runtime files with no behavior regressions |
+| P2 | Harden WhatsApp browser sandbox defaults | Reduce browser compromise blast radius | M | Channel config schema + adapter launch options | Secure-by-default launch path without `--no-sandbox` |
+| P2 | Decompose oversized modules | Improves maintainability and review velocity | L | Module boundary design | `src/daemon/routing.ts`, `src/frontends/tui/minimal.ts`, `src/cli/skills.ts` split into cohesive units |
+
+## Potential New Features (Prioritized Roadmap)
+
+### R-001 Runtime Config Persistence and Versioning
+- Problem: Runtime edits are not clearly durable.
+- Feature: Versioned config store with atomic patching and rollback.
+- User value: Safe live updates with auditability.
+- Effort: Medium
+- Dependencies: Config writer, schema validator, backup policy.
+- Acceptance criteria:
+  - Every patch yields version ID.
+  - Restart restores latest committed version.
+  - Rollback endpoint reverts to previous version safely.
+
+### R-002 Tool Cancellation and Kill-Switch API
+- Problem: Timeout without cancellation can leave unsafe side effects.
+- Feature: `AbortSignal`-aware tool interface + `tool.cancel` control path.
+- User value: Safer automation under failure conditions.
+- Effort: Large
+- Dependencies: Executor, tool contracts, process/browser tools.
+- Acceptance criteria:
+  - Cancel request stops active cancellable tools within SLA.
+  - Audit events include cancel reason and final state.
+
+### R-003 Security-Hardened Markdown Renderer
+- Problem: Model output is untrusted content.
+- Feature: Sanitized markdown renderer with strict allowed-tags policy.
+- User value: Safe web chat experience by default.
+- Effort: Medium
+- Dependencies: UI renderer + CSP headers.
+- Acceptance criteria:
+  - Red-team payload test suite passes.
+  - No executable markup reaches final DOM.
+
+### R-004 Quality Health Endpoint
+- Problem: No first-class quality telemetry for lint/build/test status.
+- Feature: `system.quality` endpoint exposing latest CI/local gate results.
+- User value: Faster operator diagnosis and release confidence.
+- Effort: Small
+- Dependencies: Build metadata collection.
+- Acceptance criteria:
+  - Endpoint reports lint/typecheck/test/build state with timestamp.
+  - Dashboard view consumes and displays status.
+
+### R-005 Config Drift Indicator in Dashboard
+- Problem: Runtime and file-backed config can diverge invisibly.
+- Feature: Drift warning badge and diff preview in settings page.
+- User value: Reduces operational mistakes.
+- Effort: Medium
+- Dependencies: Persisted config support and hash/diff helper.
+- Acceptance criteria:
+  - UI shows `in-sync` vs `drifted` state.
+  - Operator can preview and reconcile diff.
+
+### R-006 Tool Safety Simulator (Dry-Run Policy Check)
+- Problem: Policy/hook changes are hard to validate before production use.
+- Feature: Simulate tool calls against policy, autonomy, and hooks without execution.
+- User value: Safer policy rollout.
+- Effort: Medium
+- Dependencies: Tool policy engine and hook decision surface.
+- Acceptance criteria:
+  - Given tool name/args/context, simulator returns allow/deny + reason chain.
+  - Simulation traces can be exported for review.
+
+### R-007 Gateway Abuse Protection Pack
+- Problem: WebSocket and POST ingress currently lack explicit anti-abuse controls.
+- Feature: Unified ingress controls (body-size limits, per-connection rate limits, adaptive penalties).
+- User value: More predictable reliability and lower abuse-induced cost.
+- Effort: Medium
+- Dependencies: Gateway server, shared request parsing, metrics.
+- Acceptance criteria:
+  - Oversized payloads rejected early.
+  - Rate violations are metered and observable.
+  - Healthy traffic retains low latency under synthetic abuse tests.
+
+## 30/60/90-Day Execution Sequence
+
+### 0-30 Days
+- Ship markdown sanitization and CSP hardening.
+- Add POST body-size limits and shared body reader.
+- Restore lint error baseline to zero in mainline.
+- Add config patch persistence contract proposal and tests.
+
+### 31-60 Days
+- Implement persisted `config.patch` with versioning and rollback.
+- Introduce cancellable tool interface and migrate high-risk tools first.
+- Implement WebSocket rate limiting and abuse observability.
+- Add dashboard drift indicator.
+
+### 61-90 Days
+- Complete cancellation coverage across remaining tools.
+- Launch quality health endpoint and dashboard integration.
+- Add policy simulation tool and operator workflow docs.
+
+## Appendix: Command Evidence
+
+Commands run:
+
+```bash
+pnpm -s typecheck
+pnpm -s build
+pnpm -s test:run
+pnpm -s lint
+```
+
+Observed outcomes:
+- Typecheck/build/test: passing.
+- Lint: failing with `148 errors` and `530 warnings`.
+
+Top lint error concentration snapshot:
+- `src/daemon/models.ts`: 90 errors
+- `src/cli/tui.ts`: 25 errors
+- `src/daemon/routing.ts`: 14 errors
+- `src/gateway/ui/pages/settings.js`: 14 errors