will/flynn

Fork 0

Files

T

William Valentin e16c0bc2c7 audit follow-up: reduce warning hotspots in automation and gateway tests

2026-02-15 22:49:45 -08:00

22 KiB

Raw Blame History

Flynn Codebase Audit Report

Date: 2026-02-16
Scope: Production-risk-first audit of bugs, code improvements, and feature opportunities.

Remediation Status (2026-02-16)

✅ F-001 addressed: chat markdown rendering now sanitizes HTML before DOM insertion in src/gateway/ui/pages/chat.js (and legacy src/gateway/ui/chat.html).
✅ F-006 addressed: inbound HTTP request bodies now enforce a configurable max-size limit (server.max_request_body_bytes) with 413 Payload Too Large responses.
✅ F-007 addressed: ToolExecutor timeout timer handles are now cleared in finally, preventing orphan timers on fast/failed tool calls.
✅ F-016 addressed: gateway + webhook body readers are consolidated in shared utility src/utils/httpBody.ts with size-limit enforcement.
✅ F-005 addressed: ESLint JS globals now include FileReader, removing UI false-positive lint failures for attachment handling code.
✅ F-010 addressed: session.compact audit events now emit actual message counts for messages_before/messages_after (tokens remain in token fields).
✅ F-012 addressed: synthetic repeated-tool nudge no longer emits invalid tool_result.tool_use_id; nudge is injected as plain user text guidance.
✅ F-009 addressed: gateway now enforces per-connection WebSocket ingress rate limits with deterministic throttle errors and close-on-repeated-violation behavior.
✅ F-008 addressed: WhatsApp Chromium launch is now sandboxed by default; no-sandbox mode is behind explicit whatsapp.no_sandbox: true opt-in.
✅ F-014 addressed: ModelRouter.setOnTierChange now preserves existing listeners instead of replacing them, removing destructive listener-setter behavior.
✅ F-002 addressed: config.patch now supports durable persistence via atomic write + backup when daemon has a concrete config path, and response includes persisted/persistError so UI can distinguish runtime-only vs disk-persisted updates.
✅ F-003 addressed: tool execution now has an AbortSignal contract, executor triggers abort on timeout, high-risk tools (shell.exec, sandbox docker exec, process.start, browser tools, web.fetch, web.search) respond to cancellation, and executor regression tests verify cancellable tools do not apply side effects after timeout.
✅ F-015 addressed: retry defaults no longer classify timeout-style failures as non-retryable, improving resilience for transient timeout conditions.
✅ F-011 addressed: Slack user-name resolution now uses bounded TTL+LRU caching to prevent unbounded growth.
✅ F-013 addressed: shared channel utilities now cover reset normalization/building plus reusable allowlist, mention-gating, and pairing-gating flows across Discord/Slack/WhatsApp adapters.
◑ F-004 partially addressed: lint error baseline is restored (pnpm lint now passes with 0 errors) and warning burn-down has progressed from 466 to 247; additional warning debt remains.

Executive Summary

Current health snapshot:

pnpm typecheck: passing
pnpm build: passing
pnpm test:run: passing (140/140 files, 1773/1773 tests)
pnpm lint: passing with warnings only (0 errors, 247 warnings)

Top conclusions:

A critical Web UI security issue exists in markdown rendering (unsanitized HTML insertion).
Runtime configuration edits from the settings page appear non-persistent across restart.
Tool timeout behavior likely allows underlying side effects to continue after timeout.
Gateway request-body handling and WebSocket ingress controls need abuse protections.
Lint error-level gate is restored, and warning debt is trending down but still high.

Methodology and Scope

Method used:

Static inspection of core runtime, gateway, UI, and tooling modules.
Command validation via:
- pnpm -s typecheck
- pnpm -s build
- pnpm -s test:run
- pnpm -s lint

Scope priorities:

Security and production reliability
Runtime correctness and operational safety
Maintainability and engineering throughput
New features with measurable user/ops value

Non-goals:

No code changes in this step
No architecture rewrite proposals without incremental path

Findings (Bugs and Risks)

F-001 Critical: Web UI markdown XSS surface in chat rendering

Severity: Critical
Impact: Arbitrary script/HTML injection in browser clients if model/tool output includes malicious markup.
Evidence:
- src/gateway/ui/pages/chat.js:63
- src/gateway/ui/pages/chat.js:610
Detail:
- Assistant/system content is rendered via marked.parse(...) then inserted using innerHTML.
- No sanitizer or trusted-types/CSP-backed cleanup is applied before insertion.
Repro trigger:
- Assistant response containing HTML payload (for example, event handlers or scriptable tags) in markdown output path.
Recommended fix:
- Introduce HTML sanitization (DOMPurify or equivalent) between markdown parse and DOM insertion.
- Add restrictive CSP for UI surface.
- Disable raw HTML passthrough in markdown parser if not required.
Tests to add:
- Unit tests for sanitization of script/event-handler payloads.
- Integration test confirming rendered output excludes executable content.

F-002 High: Settings hook edits are runtime-only and likely not durable

Severity: High
Impact: Operators may believe settings were saved permanently when they only changed in-memory state.
Evidence:
- src/gateway/ui/pages/settings.js:148
- src/gateway/handlers/config.ts:135
Detail:
- config.patch mutates the active config object.
- No file write-back or persistence layer call is present in handler path.
Repro trigger:
- Save hook patterns from UI, restart process, observe values reset to file-backed config.
Recommended fix:
- Add persisted config patch pipeline (atomic write + validation + backup).
- Return persisted: true|false in response to avoid misleading UI status.
Tests to add:
- Restart-persistence integration test for hooks.confirm/log/silent.
- Error path test for rejected file write with unchanged runtime config.

F-003 High: Tool timeout does not cancel underlying execution

Severity: High
Impact: Timed-out tools can continue executing side effects after caller has received timeout failure.
Evidence:
- src/tools/executor.ts:228
Detail:
- Timeout is implemented with Promise.race(...).
- No cancellation token/abort path is propagated to long-running tool implementations.
Repro trigger:
- Invoke a long-running side-effecting tool, timeout occurs, side effect completes anyway.
Recommended fix:
- Add abort signal support to tool contract and propagate cancellation in executor.
- Enforce cancellation support for high-risk tools (shell.exec, process.start, web/browser mutating tools).
Tests to add:
- Timeout + cancellation conformance tests for cancellable tools.
- Regression tests verifying no post-timeout side effects.

Remediation update (2026-02-16):

Abort propagation now covers executor -> context signal -> process.start, browser tools, and web fetch/search tools.
Added aborted-signal regression tests for these tool paths.
Added timeout regression coverage in src/tools/executor.test.ts to verify side effects are prevented after timeout-triggered abort for cancellable tools.

F-004 Medium: Lint quality gate is broken and concentrated in key runtime files

Severity: Medium
Impact: CI noise, reduced confidence in static analysis, and slower defect detection.
Evidence:
- pnpm -s lint => 0 errors, 247 warnings
- Error concentration:
  - src/daemon/models.ts (90 errors)
  - src/cli/tui.ts (25 errors)
  - src/daemon/routing.ts (14 errors)
  - src/gateway/ui/pages/settings.js (14 errors)
Detail:
- Most errors are style/indentation, but gate failure blocks meaningful lint signal.
- High warning volume includes many any and non-null assertions in runtime paths.
Recommended fix:
- Two-stage lint recovery:
  1. Clear all error-level violations in top files.
  2. Burn down warning debt in runtime (non-test) modules by policy.
Tests to add:
- CI check enforcing eslint errors = 0.
- Secondary threshold check for warning reduction trend.

Remediation update (2026-02-16):

Stage 1 complete: fixed all error-level ESLint violations in impacted high-error files so pnpm lint now passes with 0 errors.
Stage 2 in progress: warning-burn-down reduced to 247 warnings via targeted hotspot cleanup in:
- src/gateway/handlers/handlers.test.ts
- src/daemon/routing.test.ts
- src/frontends/tui/minimal.test.ts
- src/gateway/tailscale.test.ts
- src/automation/webhooks.test.ts
- src/automation/cron.test.ts
- src/automation/heartbeat.test.ts
- src/frontends/tui/minimal.login.test.ts

F-005 Medium: ESLint browser globals mismatch causes avoidable UI lint failures

Severity: Medium
Impact: False-positive lint failures and friction for frontend changes.
Evidence:
- src/gateway/ui/pages/chat.js:127 (FileReader not defined in lint)
- eslint.config.js browser globals block does not include FileReader.
Recommended fix:
- Add missing browser globals used by UI pages, or set UI files to browser env with parser options.
Tests to add:
- Lint regression case for UI page using file attachment APIs.

F-006 High: Request body reader has no size limit (memory DoS risk)

Severity: High
Impact: Large POST bodies can force unbounded buffering and memory pressure/crash.
Evidence:
- src/gateway/server.ts:446
Detail:
- readRequestBody concatenates all chunks without max-size guard.
- Path is used by inbound handlers (webhooks/Gmail push route handling).
Repro trigger:
- Send very large request body to exposed POST endpoint.
Recommended fix:
- Add strict max body size (for example 1 MB default, configurable).
- Stop reading and destroy request when threshold is exceeded.
- Return 413 Payload Too Large.
Tests to add:
- Body-under-limit succeeds.
- Body-over-limit terminates early and returns 413.

F-007 Medium: Tool timeout race leaves orphan timers

Severity: Medium
Impact: Pending timeouts accumulate under load, increasing event-loop overhead.
Evidence:
- src/tools/executor.ts:228
Detail:
- Timeout promise created in Promise.race is never cleared when tool resolves quickly.
- This is distinct from the cancellation gap in F-003.
Recommended fix:
- Store timer handle and clearTimeout(...) in finally after race settles.
Tests to add:
- Repeated short tool calls do not grow active timer count.

F-008 Medium: WhatsApp channel launches Chromium without sandbox

Severity: Medium
Impact: Browser compromise has higher blast radius when sandbox is disabled.
Evidence:
- src/channels/whatsapp/adapter.ts:95
Detail:
- Puppeteer launch args include --no-sandbox and --disable-setuid-sandbox.
Recommended fix:
- Default to sandboxed mode.
- Gate no-sandbox behind explicit high-trust config with warning logs.
Tests to add:
- Config test verifies secure default launch args.

F-009 Medium: Gateway WebSocket ingress lacks rate limiting

Severity: Medium
Impact: Abuse can cause resource starvation and model API cost spikes.
Evidence:
- src/gateway/server.ts:291
Detail:
- Incoming WebSocket messages are processed without per-connection throttling.
Recommended fix:
- Add token-bucket or leaky-bucket rate limit by connection/session/IP.
- Add backoff/close behavior for repeated violations.
Tests to add:
- Burst traffic triggers throttling and does not starve normal traffic.

F-010 Medium: Compaction audit event logs token counts in message-count fields

Severity: Medium
Impact: Audit accuracy and observability metrics are corrupted.
Evidence:
- src/backends/native/orchestrator.ts:311
Detail:
- sessionCompact event uses token counts where message counts are expected.
Recommended fix:
- Emit actual pre/post message counts from compaction result.
Tests to add:
- Audit event schema/value validation for compaction events.

F-011 Low: Slack user-name cache has no eviction strategy

Severity: Low
Impact: Slow unbounded memory growth in long-running Slack deployments.
Evidence:
- src/channels/slack/adapter.ts:66
Recommended fix:
- Use LRU/TTL cache with maximum entry count.

Remediation update (2026-02-16):

SlackAdapter user-name cache now has TTL expiry and max-entry LRU eviction behavior.
Added cache regression tests for cache-hit reuse and TTL refresh in src/channels/slack/adapter.test.ts.

F-012 Low-Medium: Synthetic tool nudge uses invalid `tool_use_id`

Severity: Low-Medium
Impact: Strict providers may reject tool_result blocks that reference unknown IDs.
Evidence:
- src/backends/native/agent.ts:347
Recommended fix:
- Emit nudge as plain assistant text or tie to valid preceding tool-use ID.

F-013 Low: Channel adapters duplicate pairing/reset/mention/chunking logic

Severity: Low
Impact: Behavior drift and higher maintenance cost across channel implementations.
Evidence:
- src/channels/discord/adapter.ts
- src/channels/slack/adapter.ts
- src/channels/whatsapp/adapter.ts
Recommended fix:
- Extract shared middleware utilities for common inbound/outbound behaviors.

Remediation update (2026-02-16):

Added shared normalizeResetCommandText() + buildResetInboundMessage() utilities and migrated Discord/Slack/WhatsApp adapters to use them.
Added shared isAllowedByAllowlist(), shouldIgnoreForMissingMention(), and allowTrustedOrPairedSender() channel utilities.
Migrated Discord/Slack/WhatsApp adapters to use shared allowlist, mention-gating, and pairing-gating flows with adapter-specific transport hooks.

F-014 Low: ModelRouter listener API has destructive setter footgun

Severity: Low
Impact: setOnTierChange silently replaces listeners added through append API.
Evidence:
- src/models/router.ts:69
Recommended fix:
- Deprecate/remove setter or explicitly document destructive semantics.

F-015 Low: Retry defaults classify timeouts as non-retryable

Severity: Low
Impact: Reduces resilience for transient timeout failures.
Evidence:
- src/models/retry.ts:29
Recommended fix:
- Remove timeout markers from non-retryable defaults or make behavior configurable.

F-016 Low: Duplicate request-body utility logic in two modules

Severity: Low
Impact: Duplicate maintenance surface and inconsistent future behavior.
Evidence:
- src/gateway/server.ts:446
- src/automation/webhooks.ts:17
Recommended fix:
- Centralize in shared utility with size-limit support.

Remediation update (2026-02-16):

Request body parsing is centralized in src/utils/httpBody.ts and consumed by both gateway server and webhook handler paths.

F-017 Low: NativeAgent history getter returns mixed mutability semantics

Severity: Low
Impact: Caller behavior differs by storage path (by-reference vs copied array).
Evidence:
- src/backends/native/agent.ts:72
Recommended fix:
- Return immutable copy consistently across both paths.

F-018 Low: Session pruning interval has no jitter

Severity: Low
Impact: Multi-instance deployments can synchronize prune spikes.
Evidence:
- src/daemon/index.ts:91
Recommended fix:
- Add randomized jitter window to pruning schedule.

F-019 Low: `_restoreHistory` no-op when no session is attached

Severity: Low
Impact: Sessionless recovery path cannot rollback conversational state.
Evidence:
- src/backends/native/orchestrator.ts:499
Recommended fix:
- Keep explicit in-memory rollback fallback for sessionless mode.

Code Improvements (Prioritized Backlog)

Priority	Item	Why	Effort	Dependencies	Acceptance Criteria
P0	Secure markdown render pipeline in gateway UI	Closes critical XSS class	M	UI markdown stack + sanitizer lib	All assistant/tool markdown is sanitized before `innerHTML`; security tests pass
P0	Add request-body hard limits in gateway/webhooks	Prevents memory DoS on exposed POST endpoints	S	Shared body reader utility	Oversized payloads are rejected with 413 and no unbounded buffering
P0	Persistent config patch service	Prevents config drift and operator confusion	M	Config loader/writer + schema validation	`config.patch` writes atomically, survives restart, reports persistence status
P1	Tool execution cancellation contract	Prevents post-timeout side effects	L	Tool type updates + executor + tool impls	Timeout triggers cancellation and verified no continued side effects
P1	Fix timeout timer leak in ToolExecutor	Prevents event-loop timer buildup under load	S	ToolExecutor timeout refactor	Timer is always cleared when tool execution settles
P1	WebSocket ingress rate limiting	Mitigates abuse and cost/resource exhaustion	M	Gateway connection/session identity	Excess message bursts are throttled or rejected deterministically
P1	Lint gate restoration program	Recovers static-analysis trust	M	ESLint config + high-error files	`pnpm lint` returns 0 errors on main branch
P2	Reduce runtime `any`/non-null usage	Lowers hidden runtime risk	M	Typed wrappers for channel/model SDKs	Warning count reduced in non-test runtime files with no behavior regressions
P2	Harden WhatsApp browser sandbox defaults	Reduce browser compromise blast radius	M	Channel config schema + adapter launch options	Secure-by-default launch path without `--no-sandbox`
P2	Decompose oversized modules	Improves maintainability and review velocity	L	Module boundary design	`src/daemon/routing.ts`, `src/frontends/tui/minimal.ts`, `src/cli/skills.ts` split into cohesive units

Potential New Features (Prioritized Roadmap)

R-001 Runtime Config Persistence and Versioning

Problem: Runtime edits are not clearly durable.
Feature: Versioned config store with atomic patching and rollback.
User value: Safe live updates with auditability.
Effort: Medium
Dependencies: Config writer, schema validator, backup policy.
Acceptance criteria:
- Every patch yields version ID.
- Restart restores latest committed version.
- Rollback endpoint reverts to previous version safely.

R-002 Tool Cancellation and Kill-Switch API

Problem: Timeout without cancellation can leave unsafe side effects.
Feature: AbortSignal-aware tool interface + tool.cancel control path.
User value: Safer automation under failure conditions.
Effort: Large
Dependencies: Executor, tool contracts, process/browser tools.
Acceptance criteria:
- Cancel request stops active cancellable tools within SLA.
- Audit events include cancel reason and final state.

R-003 Security-Hardened Markdown Renderer

Problem: Model output is untrusted content.
Feature: Sanitized markdown renderer with strict allowed-tags policy.
User value: Safe web chat experience by default.
Effort: Medium
Dependencies: UI renderer + CSP headers.
Acceptance criteria:
- Red-team payload test suite passes.
- No executable markup reaches final DOM.

R-004 Quality Health Endpoint

Problem: No first-class quality telemetry for lint/build/test status.
Feature: system.quality endpoint exposing latest CI/local gate results.
User value: Faster operator diagnosis and release confidence.
Effort: Small
Dependencies: Build metadata collection.
Acceptance criteria:
- Endpoint reports lint/typecheck/test/build state with timestamp.
- Dashboard view consumes and displays status.

R-005 Config Drift Indicator in Dashboard

Problem: Runtime and file-backed config can diverge invisibly.
Feature: Drift warning badge and diff preview in settings page.
User value: Reduces operational mistakes.
Effort: Medium
Dependencies: Persisted config support and hash/diff helper.
Acceptance criteria:
- UI shows in-sync vs drifted state.
- Operator can preview and reconcile diff.

R-006 Tool Safety Simulator (Dry-Run Policy Check)

Problem: Policy/hook changes are hard to validate before production use.
Feature: Simulate tool calls against policy, autonomy, and hooks without execution.
User value: Safer policy rollout.
Effort: Medium
Dependencies: Tool policy engine and hook decision surface.
Acceptance criteria:
- Given tool name/args/context, simulator returns allow/deny + reason chain.
- Simulation traces can be exported for review.

R-007 Gateway Abuse Protection Pack

Problem: WebSocket and POST ingress currently lack explicit anti-abuse controls.
Feature: Unified ingress controls (body-size limits, per-connection rate limits, adaptive penalties).
User value: More predictable reliability and lower abuse-induced cost.
Effort: Medium
Dependencies: Gateway server, shared request parsing, metrics.
Acceptance criteria:
- Oversized payloads rejected early.
- Rate violations are metered and observable.
- Healthy traffic retains low latency under synthetic abuse tests.

30/60/90-Day Execution Sequence

0-30 Days

Ship markdown sanitization and CSP hardening.
Add POST body-size limits and shared body reader.
Restore lint error baseline to zero in mainline.
Add config patch persistence contract proposal and tests.

31-60 Days

Implement persisted config.patch with versioning and rollback.
Introduce cancellable tool interface and migrate high-risk tools first.
Implement WebSocket rate limiting and abuse observability.
Add dashboard drift indicator.

61-90 Days

Complete cancellation coverage across remaining tools.
Launch quality health endpoint and dashboard integration.
Add policy simulation tool and operator workflow docs.

Appendix: Command Evidence

Commands run:

pnpm -s typecheck
pnpm -s build
pnpm -s test:run
pnpm -s lint

Observed outcomes:

Typecheck/build/test: passing.
Lint: passing with warnings only (0 errors, 247 warnings).

Historical pre-remediation lint error concentration snapshot:

src/daemon/models.ts: 90 errors
src/cli/tui.ts: 25 errors
src/daemon/routing.ts: 14 errors
src/gateway/ui/pages/settings.js: 14 errors

22 KiB Raw Blame History

Flynn Codebase Audit Report

Remediation Status (2026-02-16)

Executive Summary

Methodology and Scope

Findings (Bugs and Risks)

F-001 Critical: Web UI markdown XSS surface in chat rendering

F-002 High: Settings hook edits are runtime-only and likely not durable

F-003 High: Tool timeout does not cancel underlying execution

F-004 Medium: Lint quality gate is broken and concentrated in key runtime files

F-005 Medium: ESLint browser globals mismatch causes avoidable UI lint failures

F-006 High: Request body reader has no size limit (memory DoS risk)

F-007 Medium: Tool timeout race leaves orphan timers

F-008 Medium: WhatsApp channel launches Chromium without sandbox

F-009 Medium: Gateway WebSocket ingress lacks rate limiting

F-010 Medium: Compaction audit event logs token counts in message-count fields

F-011 Low: Slack user-name cache has no eviction strategy

F-012 Low-Medium: Synthetic tool nudge uses invalid tool_use_id

F-013 Low: Channel adapters duplicate pairing/reset/mention/chunking logic

F-014 Low: ModelRouter listener API has destructive setter footgun

F-015 Low: Retry defaults classify timeouts as non-retryable

F-016 Low: Duplicate request-body utility logic in two modules

F-017 Low: NativeAgent history getter returns mixed mutability semantics

F-018 Low: Session pruning interval has no jitter

F-019 Low: _restoreHistory no-op when no session is attached

Code Improvements (Prioritized Backlog)

Potential New Features (Prioritized Roadmap)

R-001 Runtime Config Persistence and Versioning

R-002 Tool Cancellation and Kill-Switch API

R-003 Security-Hardened Markdown Renderer

R-004 Quality Health Endpoint

R-005 Config Drift Indicator in Dashboard

R-006 Tool Safety Simulator (Dry-Run Policy Check)

R-007 Gateway Abuse Protection Pack

30/60/90-Day Execution Sequence

0-30 Days

31-60 Days

61-90 Days

Appendix: Command Evidence

22 KiB

Raw Blame History

F-012 Low-Medium: Synthetic tool nudge uses invalid `tool_use_id`

F-019 Low: `_restoreHistory` no-op when no session is attached