From 157e99ccb57e2de60c4dd5ac62cde8ef0fee765a Mon Sep 17 00:00:00 2001 From: William Valentin Date: Sun, 15 Feb 2026 21:37:04 -0800 Subject: [PATCH] docs: add consolidated codebase audit report --- .../2026-02-16-codebase-audit-report.md | 419 ++++++++++++++++++ 1 file changed, 419 insertions(+) create mode 100644 docs/plans/analysis/2026-02-16-codebase-audit-report.md diff --git a/docs/plans/analysis/2026-02-16-codebase-audit-report.md b/docs/plans/analysis/2026-02-16-codebase-audit-report.md new file mode 100644 index 0000000..1bdafab --- /dev/null +++ b/docs/plans/analysis/2026-02-16-codebase-audit-report.md @@ -0,0 +1,419 @@ +# Flynn Codebase Audit Report + +Date: 2026-02-16 +Scope: Production-risk-first audit of bugs, code improvements, and feature opportunities. + +## Executive Summary + +Current health snapshot: +- `pnpm typecheck`: passing +- `pnpm build`: passing +- `pnpm test:run`: passing (`137/137` files, `1735/1735` tests) +- `pnpm lint`: failing (`148 errors`, `530 warnings`) + +Top conclusions: +- A critical Web UI security issue exists in markdown rendering (unsanitized HTML insertion). +- Runtime configuration edits from the settings page appear non-persistent across restart. +- Tool timeout behavior likely allows underlying side effects to continue after timeout. +- Gateway request-body handling and WebSocket ingress controls need abuse protections. +- Lint quality gates are currently broken at scale, reducing CI signal quality. + +## Methodology and Scope + +Method used: +- Static inspection of core runtime, gateway, UI, and tooling modules. +- Command validation via: + - `pnpm -s typecheck` + - `pnpm -s build` + - `pnpm -s test:run` + - `pnpm -s lint` + +Scope priorities: +1. Security and production reliability +2. Runtime correctness and operational safety +3. Maintainability and engineering throughput +4. New features with measurable user/ops value + +Non-goals: +- No code changes in this step +- No architecture rewrite proposals without incremental path + +## Findings (Bugs and Risks) + +### F-001 Critical: Web UI markdown XSS surface in chat rendering + +- Severity: Critical +- Impact: Arbitrary script/HTML injection in browser clients if model/tool output includes malicious markup. +- Evidence: + - `src/gateway/ui/pages/chat.js:63` + - `src/gateway/ui/pages/chat.js:610` +- Detail: + - Assistant/system content is rendered via `marked.parse(...)` then inserted using `innerHTML`. + - No sanitizer or trusted-types/CSP-backed cleanup is applied before insertion. +- Repro trigger: + - Assistant response containing HTML payload (for example, event handlers or scriptable tags) in markdown output path. +- Recommended fix: + - Introduce HTML sanitization (DOMPurify or equivalent) between markdown parse and DOM insertion. + - Add restrictive CSP for UI surface. + - Disable raw HTML passthrough in markdown parser if not required. +- Tests to add: + - Unit tests for sanitization of script/event-handler payloads. + - Integration test confirming rendered output excludes executable content. + +### F-002 High: Settings hook edits are runtime-only and likely not durable + +- Severity: High +- Impact: Operators may believe settings were saved permanently when they only changed in-memory state. +- Evidence: + - `src/gateway/ui/pages/settings.js:148` + - `src/gateway/handlers/config.ts:135` +- Detail: + - `config.patch` mutates the active `config` object. + - No file write-back or persistence layer call is present in handler path. +- Repro trigger: + - Save hook patterns from UI, restart process, observe values reset to file-backed config. +- Recommended fix: + - Add persisted config patch pipeline (atomic write + validation + backup). + - Return `persisted: true|false` in response to avoid misleading UI status. +- Tests to add: + - Restart-persistence integration test for `hooks.confirm/log/silent`. + - Error path test for rejected file write with unchanged runtime config. + +### F-003 High: Tool timeout does not cancel underlying execution + +- Severity: High +- Impact: Timed-out tools can continue executing side effects after caller has received timeout failure. +- Evidence: + - `src/tools/executor.ts:228` +- Detail: + - Timeout is implemented with `Promise.race(...)`. + - No cancellation token/abort path is propagated to long-running tool implementations. +- Repro trigger: + - Invoke a long-running side-effecting tool, timeout occurs, side effect completes anyway. +- Recommended fix: + - Add abort signal support to tool contract and propagate cancellation in executor. + - Enforce cancellation support for high-risk tools (`shell.exec`, `process.start`, web/browser mutating tools). +- Tests to add: + - Timeout + cancellation conformance tests for cancellable tools. + - Regression tests verifying no post-timeout side effects. + +### F-004 Medium: Lint quality gate is broken and concentrated in key runtime files + +- Severity: Medium +- Impact: CI noise, reduced confidence in static analysis, and slower defect detection. +- Evidence: + - `pnpm -s lint` => `148 errors`, `530 warnings` + - Error concentration: + - `src/daemon/models.ts` (90 errors) + - `src/cli/tui.ts` (25 errors) + - `src/daemon/routing.ts` (14 errors) + - `src/gateway/ui/pages/settings.js` (14 errors) +- Detail: + - Most errors are style/indentation, but gate failure blocks meaningful lint signal. + - High warning volume includes many `any` and non-null assertions in runtime paths. +- Recommended fix: + - Two-stage lint recovery: + 1. Clear all error-level violations in top files. + 2. Burn down warning debt in runtime (non-test) modules by policy. +- Tests to add: + - CI check enforcing `eslint` errors = 0. + - Secondary threshold check for warning reduction trend. + +### F-005 Medium: ESLint browser globals mismatch causes avoidable UI lint failures + +- Severity: Medium +- Impact: False-positive lint failures and friction for frontend changes. +- Evidence: + - `src/gateway/ui/pages/chat.js:127` (`FileReader` not defined in lint) + - `eslint.config.js` browser globals block does not include `FileReader`. +- Recommended fix: + - Add missing browser globals used by UI pages, or set UI files to browser env with parser options. +- Tests to add: + - Lint regression case for UI page using file attachment APIs. + +### F-006 High: Request body reader has no size limit (memory DoS risk) + +- Severity: High +- Impact: Large POST bodies can force unbounded buffering and memory pressure/crash. +- Evidence: + - `src/gateway/server.ts:446` +- Detail: + - `readRequestBody` concatenates all chunks without max-size guard. + - Path is used by inbound handlers (webhooks/Gmail push route handling). +- Repro trigger: + - Send very large request body to exposed POST endpoint. +- Recommended fix: + - Add strict max body size (for example 1 MB default, configurable). + - Stop reading and destroy request when threshold is exceeded. + - Return `413 Payload Too Large`. +- Tests to add: + - Body-under-limit succeeds. + - Body-over-limit terminates early and returns 413. + +### F-007 Medium: Tool timeout race leaves orphan timers + +- Severity: Medium +- Impact: Pending timeouts accumulate under load, increasing event-loop overhead. +- Evidence: + - `src/tools/executor.ts:228` +- Detail: + - Timeout promise created in `Promise.race` is never cleared when tool resolves quickly. + - This is distinct from the cancellation gap in F-003. +- Recommended fix: + - Store timer handle and `clearTimeout(...)` in `finally` after race settles. +- Tests to add: + - Repeated short tool calls do not grow active timer count. + +### F-008 Medium: WhatsApp channel launches Chromium without sandbox + +- Severity: Medium +- Impact: Browser compromise has higher blast radius when sandbox is disabled. +- Evidence: + - `src/channels/whatsapp/adapter.ts:95` +- Detail: + - Puppeteer launch args include `--no-sandbox` and `--disable-setuid-sandbox`. +- Recommended fix: + - Default to sandboxed mode. + - Gate no-sandbox behind explicit high-trust config with warning logs. +- Tests to add: + - Config test verifies secure default launch args. + +### F-009 Medium: Gateway WebSocket ingress lacks rate limiting + +- Severity: Medium +- Impact: Abuse can cause resource starvation and model API cost spikes. +- Evidence: + - `src/gateway/server.ts:291` +- Detail: + - Incoming WebSocket messages are processed without per-connection throttling. +- Recommended fix: + - Add token-bucket or leaky-bucket rate limit by connection/session/IP. + - Add backoff/close behavior for repeated violations. +- Tests to add: + - Burst traffic triggers throttling and does not starve normal traffic. + +### F-010 Medium: Compaction audit event logs token counts in message-count fields + +- Severity: Medium +- Impact: Audit accuracy and observability metrics are corrupted. +- Evidence: + - `src/backends/native/orchestrator.ts:311` +- Detail: + - `sessionCompact` event uses token counts where message counts are expected. +- Recommended fix: + - Emit actual pre/post message counts from compaction result. +- Tests to add: + - Audit event schema/value validation for compaction events. + +### F-011 Low: Slack user-name cache has no eviction strategy + +- Severity: Low +- Impact: Slow unbounded memory growth in long-running Slack deployments. +- Evidence: + - `src/channels/slack/adapter.ts:66` +- Recommended fix: + - Use LRU/TTL cache with maximum entry count. + +### F-012 Low-Medium: Synthetic tool nudge uses invalid `tool_use_id` + +- Severity: Low-Medium +- Impact: Strict providers may reject tool_result blocks that reference unknown IDs. +- Evidence: + - `src/backends/native/agent.ts:347` +- Recommended fix: + - Emit nudge as plain assistant text or tie to valid preceding tool-use ID. + +### F-013 Low: Channel adapters duplicate pairing/reset/mention/chunking logic + +- Severity: Low +- Impact: Behavior drift and higher maintenance cost across channel implementations. +- Evidence: + - `src/channels/discord/adapter.ts` + - `src/channels/slack/adapter.ts` + - `src/channels/whatsapp/adapter.ts` +- Recommended fix: + - Extract shared middleware utilities for common inbound/outbound behaviors. + +### F-014 Low: ModelRouter listener API has destructive setter footgun + +- Severity: Low +- Impact: `setOnTierChange` silently replaces listeners added through append API. +- Evidence: + - `src/models/router.ts:69` +- Recommended fix: + - Deprecate/remove setter or explicitly document destructive semantics. + +### F-015 Low: Retry defaults classify timeouts as non-retryable + +- Severity: Low +- Impact: Reduces resilience for transient timeout failures. +- Evidence: + - `src/models/retry.ts:29` +- Recommended fix: + - Remove timeout markers from non-retryable defaults or make behavior configurable. + +### F-016 Low: Duplicate request-body utility logic in two modules + +- Severity: Low +- Impact: Duplicate maintenance surface and inconsistent future behavior. +- Evidence: + - `src/gateway/server.ts:446` + - `src/automation/webhooks.ts:17` +- Recommended fix: + - Centralize in shared utility with size-limit support. + +### F-017 Low: NativeAgent history getter returns mixed mutability semantics + +- Severity: Low +- Impact: Caller behavior differs by storage path (by-reference vs copied array). +- Evidence: + - `src/backends/native/agent.ts:72` +- Recommended fix: + - Return immutable copy consistently across both paths. + +### F-018 Low: Session pruning interval has no jitter + +- Severity: Low +- Impact: Multi-instance deployments can synchronize prune spikes. +- Evidence: + - `src/daemon/index.ts:91` +- Recommended fix: + - Add randomized jitter window to pruning schedule. + +### F-019 Low: `_restoreHistory` no-op when no session is attached + +- Severity: Low +- Impact: Sessionless recovery path cannot rollback conversational state. +- Evidence: + - `src/backends/native/orchestrator.ts:499` +- Recommended fix: + - Keep explicit in-memory rollback fallback for sessionless mode. + +## Code Improvements (Prioritized Backlog) + +| Priority | Item | Why | Effort | Dependencies | Acceptance Criteria | +|---|---|---|---|---|---| +| P0 | Secure markdown render pipeline in gateway UI | Closes critical XSS class | M | UI markdown stack + sanitizer lib | All assistant/tool markdown is sanitized before `innerHTML`; security tests pass | +| P0 | Add request-body hard limits in gateway/webhooks | Prevents memory DoS on exposed POST endpoints | S | Shared body reader utility | Oversized payloads are rejected with 413 and no unbounded buffering | +| P0 | Persistent config patch service | Prevents config drift and operator confusion | M | Config loader/writer + schema validation | `config.patch` writes atomically, survives restart, reports persistence status | +| P1 | Tool execution cancellation contract | Prevents post-timeout side effects | L | Tool type updates + executor + tool impls | Timeout triggers cancellation and verified no continued side effects | +| P1 | Fix timeout timer leak in ToolExecutor | Prevents event-loop timer buildup under load | S | ToolExecutor timeout refactor | Timer is always cleared when tool execution settles | +| P1 | WebSocket ingress rate limiting | Mitigates abuse and cost/resource exhaustion | M | Gateway connection/session identity | Excess message bursts are throttled or rejected deterministically | +| P1 | Lint gate restoration program | Recovers static-analysis trust | M | ESLint config + high-error files | `pnpm lint` returns 0 errors on main branch | +| P2 | Reduce runtime `any`/non-null usage | Lowers hidden runtime risk | M | Typed wrappers for channel/model SDKs | Warning count reduced in non-test runtime files with no behavior regressions | +| P2 | Harden WhatsApp browser sandbox defaults | Reduce browser compromise blast radius | M | Channel config schema + adapter launch options | Secure-by-default launch path without `--no-sandbox` | +| P2 | Decompose oversized modules | Improves maintainability and review velocity | L | Module boundary design | `src/daemon/routing.ts`, `src/frontends/tui/minimal.ts`, `src/cli/skills.ts` split into cohesive units | + +## Potential New Features (Prioritized Roadmap) + +### R-001 Runtime Config Persistence and Versioning +- Problem: Runtime edits are not clearly durable. +- Feature: Versioned config store with atomic patching and rollback. +- User value: Safe live updates with auditability. +- Effort: Medium +- Dependencies: Config writer, schema validator, backup policy. +- Acceptance criteria: + - Every patch yields version ID. + - Restart restores latest committed version. + - Rollback endpoint reverts to previous version safely. + +### R-002 Tool Cancellation and Kill-Switch API +- Problem: Timeout without cancellation can leave unsafe side effects. +- Feature: `AbortSignal`-aware tool interface + `tool.cancel` control path. +- User value: Safer automation under failure conditions. +- Effort: Large +- Dependencies: Executor, tool contracts, process/browser tools. +- Acceptance criteria: + - Cancel request stops active cancellable tools within SLA. + - Audit events include cancel reason and final state. + +### R-003 Security-Hardened Markdown Renderer +- Problem: Model output is untrusted content. +- Feature: Sanitized markdown renderer with strict allowed-tags policy. +- User value: Safe web chat experience by default. +- Effort: Medium +- Dependencies: UI renderer + CSP headers. +- Acceptance criteria: + - Red-team payload test suite passes. + - No executable markup reaches final DOM. + +### R-004 Quality Health Endpoint +- Problem: No first-class quality telemetry for lint/build/test status. +- Feature: `system.quality` endpoint exposing latest CI/local gate results. +- User value: Faster operator diagnosis and release confidence. +- Effort: Small +- Dependencies: Build metadata collection. +- Acceptance criteria: + - Endpoint reports lint/typecheck/test/build state with timestamp. + - Dashboard view consumes and displays status. + +### R-005 Config Drift Indicator in Dashboard +- Problem: Runtime and file-backed config can diverge invisibly. +- Feature: Drift warning badge and diff preview in settings page. +- User value: Reduces operational mistakes. +- Effort: Medium +- Dependencies: Persisted config support and hash/diff helper. +- Acceptance criteria: + - UI shows `in-sync` vs `drifted` state. + - Operator can preview and reconcile diff. + +### R-006 Tool Safety Simulator (Dry-Run Policy Check) +- Problem: Policy/hook changes are hard to validate before production use. +- Feature: Simulate tool calls against policy, autonomy, and hooks without execution. +- User value: Safer policy rollout. +- Effort: Medium +- Dependencies: Tool policy engine and hook decision surface. +- Acceptance criteria: + - Given tool name/args/context, simulator returns allow/deny + reason chain. + - Simulation traces can be exported for review. + +### R-007 Gateway Abuse Protection Pack +- Problem: WebSocket and POST ingress currently lack explicit anti-abuse controls. +- Feature: Unified ingress controls (body-size limits, per-connection rate limits, adaptive penalties). +- User value: More predictable reliability and lower abuse-induced cost. +- Effort: Medium +- Dependencies: Gateway server, shared request parsing, metrics. +- Acceptance criteria: + - Oversized payloads rejected early. + - Rate violations are metered and observable. + - Healthy traffic retains low latency under synthetic abuse tests. + +## 30/60/90-Day Execution Sequence + +### 0-30 Days +- Ship markdown sanitization and CSP hardening. +- Add POST body-size limits and shared body reader. +- Restore lint error baseline to zero in mainline. +- Add config patch persistence contract proposal and tests. + +### 31-60 Days +- Implement persisted `config.patch` with versioning and rollback. +- Introduce cancellable tool interface and migrate high-risk tools first. +- Implement WebSocket rate limiting and abuse observability. +- Add dashboard drift indicator. + +### 61-90 Days +- Complete cancellation coverage across remaining tools. +- Launch quality health endpoint and dashboard integration. +- Add policy simulation tool and operator workflow docs. + +## Appendix: Command Evidence + +Commands run: + +```bash +pnpm -s typecheck +pnpm -s build +pnpm -s test:run +pnpm -s lint +``` + +Observed outcomes: +- Typecheck/build/test: passing. +- Lint: failing with `148 errors` and `530 warnings`. + +Top lint error concentration snapshot: +- `src/daemon/models.ts`: 90 errors +- `src/cli/tui.ts`: 25 errors +- `src/daemon/routing.ts`: 14 errors +- `src/gateway/ui/pages/settings.js`: 14 errors