Files
flynn/.planning/phases/03-live-ops-dashboard/03-01-PLAN.md
T
2026-02-09 21:10:03 -08:00

14 KiB

phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
phase plan type wave depends_on files_modified autonomous must_haves
03-live-ops-dashboard 01 execute 1
src/gateway/metrics.ts
src/gateway/metrics.test.ts
src/gateway/handlers/system.ts
src/gateway/server.ts
src/daemon/services.ts
true
truths artifacts key_links
MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)
Gateway exposes system.metrics and system.events RPC methods returning accumulated data
GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket
Errors and significant events are captured in a ring buffer accessible via RPC
Active agent requests are tracked (in-flight count, tool executions, session IDs)
path provides exports
src/gateway/metrics.ts MetricsCollector class — single source of truth for all ops metrics
MetricsCollector
path provides contains
src/gateway/metrics.test.ts Tests for MetricsCollector describe.*MetricsCollector
path provides contains
src/gateway/handlers/system.ts New system.metrics, system.events, system.activeRequests RPC handlers system.metrics
path provides contains
src/gateway/server.ts HTTP /health endpoint and MetricsCollector wiring /health
path provides contains
src/daemon/services.ts MetricsCollector creation and wiring into gateway MetricsCollector
from to via pattern
src/gateway/server.ts src/gateway/metrics.ts GatewayServer holds MetricsCollector ref, passes to handlers metrics.*MetricsCollector
from to via pattern
src/gateway/handlers/system.ts src/gateway/metrics.ts System handlers read from MetricsCollector getMetrics|getEvents|getActiveRequests
from to via pattern
src/daemon/services.ts src/gateway/metrics.ts createGateway instantiates MetricsCollector new MetricsCollector
Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint.

Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.

Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.

<execution_context> @/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md @/home/will/.config/opencode/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @src/gateway/server.ts @src/gateway/handlers/system.ts @src/gateway/handlers/index.ts @src/gateway/protocol.ts @src/gateway/router.ts @src/gateway/handlers/agent.ts @src/gateway/session-bridge.ts @src/daemon/services.ts @src/daemon/index.ts Task 1: Create MetricsCollector and wire into gateway src/gateway/metrics.ts src/gateway/metrics.test.ts src/gateway/server.ts src/gateway/handlers/system.ts src/gateway/handlers/index.ts src/daemon/services.ts Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:

Counters (simple incrementing numbers):

  • messagesProcessed — incremented each time an agent.send completes (success or error)
  • errors — incremented on agent.send errors and any other recorded errors
  • activeRequests — gauge (increment on start, decrement on end)

Model call metrics (ring buffer of recent calls, max 200 entries): Each entry: { timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }

  • recordModelCall(entry) — push to ring buffer
  • getModelMetrics() — return the array

Event stream (ring buffer of recent events, max 500 entries): Each entry: { timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }

  • recordEvent(event) — push to ring buffer
  • getEvents(opts?: { level?: string, limit?: number }) — return filtered/limited array (newest first)

Active request tracking:

  • startRequest(id: string, info: { sessionId: string, channel: string }) — records start time + info
  • endRequest(id: string) — removes from active map
  • getActiveRequests() — returns array of { id, sessionId, channel, startedAt, durationMs }

Snapshot method:

  • getSnapshot() — returns { messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }
  • Accept a getQueueDepth callback in constructor for LaneQueue integration

The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.

Wire MetricsCollector into the gateway:

  1. In src/gateway/server.ts:

    • Add metrics?: MetricsCollector to GatewayServerConfig interface
    • Store the metrics instance on the GatewayServer class
    • In handleHttpRequest, add a handler for GET /health BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: { status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }. Use the same data sources as system.health RPC handler. Set Content-Type: application/json.
    • In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose getMetrics() accessor on GatewayServer so handlers can access it.
  2. In src/gateway/handlers/system.ts:

    • Add getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number } to SystemHandlerDeps
    • Add getEvents?: () => unknown[] and getActiveRequests?: () => unknown[] to SystemHandlerDeps
    • Add system.metrics handler: returns getMetrics() snapshot
    • Add system.events handler: returns getEvents() with optional level and limit params
    • Add system.activeRequests handler: returns getActiveRequests() array
    • Update the re-exports in src/gateway/handlers/index.ts if any new types need exporting
  3. In src/gateway/handlers/agent.ts (or via a wrapper in server.ts):

    • The metrics recording for agent.send happens naturally. In src/gateway/server.ts, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a onRequestStart and onRequestEnd callback to AgentHandlerDeps so the server can hook MetricsCollector in.
  4. In src/daemon/services.ts:

    • In createGateway(), instantiate new MetricsCollector({ getQueueDepth: () => 0 }) (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there).
    • Pass it to the GatewayServer config as metrics.
    • Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In GatewayServerConfig, just add metricsEnabled?: boolean (default true). The GatewayServer constructor creates this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() }).
    • Add a totalPending() method to LaneQueue that sums all queue lengths across lanes.

Wait — simpler approach that avoids changing too many files:

  • Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
  • GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
  • This keeps the metrics concern entirely within the gateway module.

Use this simpler approach. Changes to src/daemon/services.ts are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.

Update LaneQueue (src/gateway/lane-queue.ts):

  • Add a totalPending(): number method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).

Tests in src/gateway/metrics.test.ts:

  • Test counter increment/decrement
  • Test model call ring buffer (max 200, FIFO eviction)
  • Test event ring buffer (max 500, FIFO eviction, filtering by level)
  • Test active request tracking (start, end, duration calculation)
  • Test getSnapshot returns correct shape
  • Test getEvents with level filter and limit

Run pnpm test:run to verify zero regressions plus new tests pass. Run pnpm typecheck to verify no type errors. pnpm test:run — all existing 1077 tests pass plus new metrics tests pass. pnpm typecheck — no type errors. grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts — confirms new RPC methods exist. grep -r "GET.*health\|/health" src/gateway/server.ts — confirms HTTP health endpoint exists. MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking. Three new RPC handlers registered (system.metrics, system.events, system.activeRequests). GET /health returns unauthenticated JSON health status. LaneQueue has totalPending() method. All tests pass with zero regressions.

Task 2: Hook metrics recording into agent request flow src/gateway/server.ts src/gateway/handlers/agent.ts src/gateway/lane-queue.ts Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system.
  1. In src/gateway/server.ts registerHandlers():

    • When creating agent handlers, pass onRequestStart and onRequestEnd callbacks that call this.metrics.startRequest() and this.metrics.endRequest() respectively.
    • When creating agent handlers, pass onRequestComplete callback that calls this.metrics.incrementMessages() and optionally this.metrics.incrementErrors() on error.
    • Pass onModelCall callback that the agent handler can call with latency/token data.
    • Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: metrics?: MetricsCollector. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
  2. In src/gateway/handlers/agent.ts:

    • Add metrics?: MetricsCollector to AgentHandlerDeps (import type from ../metrics.js)
    • In agent.send handler:
      • At start: const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });
      • In the try block after agent.process() resolves: deps.metrics?.incrementMessages();
      • In the catch block: deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });
      • In finally: deps.metrics?.endRequest(requestId);
    • For tool use events, record them to metrics: when event.type === 'end' and event.result and !event.result.success, increment error counter and record error event.
  3. In src/gateway/server.ts registerHandlers():

    • Pass metrics: this.metrics when constructing agent handler deps.
    • Update the system handlers construction to pass the metrics accessors:
      getMetrics: () => this.metrics.getSnapshot(),
      getEvents: (opts) => this.metrics.getEvents(opts),
      getActiveRequests: () => this.metrics.getActiveRequests(),
      
  4. Test the wiring:

    • In the existing src/gateway/server.test.ts or src/gateway/handlers/handlers.test.ts, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.

Run pnpm test:run and pnpm typecheck. pnpm test:run — all tests pass (1077 existing + new metrics tests). pnpm typecheck — no type errors. grep -r "metrics\." src/gateway/handlers/agent.ts — confirms metrics calls in agent handler. Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events. Tool failures are recorded as error events. System handlers return live metrics data from MetricsCollector. All tests pass, no type errors.

1. `pnpm test:run` — all 1077+ tests pass 2. `pnpm typecheck` — zero type errors 3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods()) 4. GET /health returns valid JSON with status, uptime, version fields 5. MetricsCollector ring buffers enforce size limits

<success_criteria>

  • MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
  • Three new RPC handlers return metrics data
  • GET /health endpoint returns unauthenticated JSON health status
  • Agent request flow records messagesProcessed, errors, active requests, and error events
  • Zero test regressions, all new tests pass </success_criteria>
After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md`