Files
2026-02-09 21:10:03 -08:00

256 lines
14 KiB
Markdown

---
phase: 03-live-ops-dashboard
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/gateway/metrics.ts
- src/gateway/metrics.test.ts
- src/gateway/handlers/system.ts
- src/gateway/server.ts
- src/daemon/services.ts
autonomous: true
must_haves:
truths:
- "MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)"
- "Gateway exposes system.metrics and system.events RPC methods returning accumulated data"
- "GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket"
- "Errors and significant events are captured in a ring buffer accessible via RPC"
- "Active agent requests are tracked (in-flight count, tool executions, session IDs)"
artifacts:
- path: "src/gateway/metrics.ts"
provides: "MetricsCollector class — single source of truth for all ops metrics"
exports: ["MetricsCollector"]
- path: "src/gateway/metrics.test.ts"
provides: "Tests for MetricsCollector"
contains: "describe.*MetricsCollector"
- path: "src/gateway/handlers/system.ts"
provides: "New system.metrics, system.events, system.activeRequests RPC handlers"
contains: "system.metrics"
- path: "src/gateway/server.ts"
provides: "HTTP /health endpoint and MetricsCollector wiring"
contains: "/health"
- path: "src/daemon/services.ts"
provides: "MetricsCollector creation and wiring into gateway"
contains: "MetricsCollector"
key_links:
- from: "src/gateway/server.ts"
to: "src/gateway/metrics.ts"
via: "GatewayServer holds MetricsCollector ref, passes to handlers"
pattern: "metrics.*MetricsCollector"
- from: "src/gateway/handlers/system.ts"
to: "src/gateway/metrics.ts"
via: "System handlers read from MetricsCollector"
pattern: "getMetrics|getEvents|getActiveRequests"
- from: "src/daemon/services.ts"
to: "src/gateway/metrics.ts"
via: "createGateway instantiates MetricsCollector"
pattern: "new MetricsCollector"
---
<objective>
Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint.
Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.
Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.
</objective>
<execution_context>
@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
@/home/will/.config/opencode/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@src/gateway/server.ts
@src/gateway/handlers/system.ts
@src/gateway/handlers/index.ts
@src/gateway/protocol.ts
@src/gateway/router.ts
@src/gateway/handlers/agent.ts
@src/gateway/session-bridge.ts
@src/daemon/services.ts
@src/daemon/index.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Create MetricsCollector and wire into gateway</name>
<files>
src/gateway/metrics.ts
src/gateway/metrics.test.ts
src/gateway/server.ts
src/gateway/handlers/system.ts
src/gateway/handlers/index.ts
src/daemon/services.ts
</files>
<action>
Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:
**Counters (simple incrementing numbers):**
- `messagesProcessed` — incremented each time an agent.send completes (success or error)
- `errors` — incremented on agent.send errors and any other recorded errors
- `activeRequests` — gauge (increment on start, decrement on end)
**Model call metrics (ring buffer of recent calls, max 200 entries):**
Each entry: `{ timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }`
- `recordModelCall(entry)` — push to ring buffer
- `getModelMetrics()` — return the array
**Event stream (ring buffer of recent events, max 500 entries):**
Each entry: `{ timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }`
- `recordEvent(event)` — push to ring buffer
- `getEvents(opts?: { level?: string, limit?: number })` — return filtered/limited array (newest first)
**Active request tracking:**
- `startRequest(id: string, info: { sessionId: string, channel: string })` — records start time + info
- `endRequest(id: string)` — removes from active map
- `getActiveRequests()` — returns array of `{ id, sessionId, channel, startedAt, durationMs }`
**Snapshot method:**
- `getSnapshot()` — returns `{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }`
- Accept a `getQueueDepth` callback in constructor for LaneQueue integration
The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.
**Wire MetricsCollector into the gateway:**
1. In `src/gateway/server.ts`:
- Add `metrics?: MetricsCollector` to `GatewayServerConfig` interface
- Store the metrics instance on the GatewayServer class
- In `handleHttpRequest`, add a handler for `GET /health` BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: `{ status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }`. Use the same data sources as `system.health` RPC handler. Set `Content-Type: application/json`.
- In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose `getMetrics()` accessor on GatewayServer so handlers can access it.
2. In `src/gateway/handlers/system.ts`:
- Add `getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }` to `SystemHandlerDeps`
- Add `getEvents?: () => unknown[]` and `getActiveRequests?: () => unknown[]` to `SystemHandlerDeps`
- Add `system.metrics` handler: returns `getMetrics()` snapshot
- Add `system.events` handler: returns `getEvents()` with optional `level` and `limit` params
- Add `system.activeRequests` handler: returns `getActiveRequests()` array
- Update the re-exports in `src/gateway/handlers/index.ts` if any new types need exporting
3. In `src/gateway/handlers/agent.ts` (or via a wrapper in server.ts):
- The metrics recording for agent.send happens naturally. In `src/gateway/server.ts`, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a `onRequestStart` and `onRequestEnd` callback to `AgentHandlerDeps` so the server can hook MetricsCollector in.
4. In `src/daemon/services.ts`:
- In `createGateway()`, instantiate `new MetricsCollector({ getQueueDepth: () => 0 })` (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there).
- Pass it to the GatewayServer config as `metrics`.
- Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In `GatewayServerConfig`, just add `metricsEnabled?: boolean` (default true). The GatewayServer constructor creates `this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() })`.
- Add a `totalPending()` method to LaneQueue that sums all queue lengths across lanes.
Wait — simpler approach that avoids changing too many files:
- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
- GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
- This keeps the metrics concern entirely within the gateway module.
Use this simpler approach. Changes to `src/daemon/services.ts` are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.
**Update LaneQueue** (`src/gateway/lane-queue.ts`):
- Add a `totalPending(): number` method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).
**Tests in `src/gateway/metrics.test.ts`:**
- Test counter increment/decrement
- Test model call ring buffer (max 200, FIFO eviction)
- Test event ring buffer (max 500, FIFO eviction, filtering by level)
- Test active request tracking (start, end, duration calculation)
- Test getSnapshot returns correct shape
- Test getEvents with level filter and limit
Run `pnpm test:run` to verify zero regressions plus new tests pass.
Run `pnpm typecheck` to verify no type errors.
</action>
<verify>
`pnpm test:run` — all existing 1077 tests pass plus new metrics tests pass.
`pnpm typecheck` — no type errors.
`grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts` — confirms new RPC methods exist.
`grep -r "GET.*health\|/health" src/gateway/server.ts` — confirms HTTP health endpoint exists.
</verify>
<done>
MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking.
Three new RPC handlers registered (system.metrics, system.events, system.activeRequests).
GET /health returns unauthenticated JSON health status.
LaneQueue has totalPending() method.
All tests pass with zero regressions.
</done>
</task>
<task type="auto">
<name>Task 2: Hook metrics recording into agent request flow</name>
<files>
src/gateway/server.ts
src/gateway/handlers/agent.ts
src/gateway/lane-queue.ts
</files>
<action>
Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system.
1. **In `src/gateway/server.ts` registerHandlers():**
- When creating agent handlers, pass `onRequestStart` and `onRequestEnd` callbacks that call `this.metrics.startRequest()` and `this.metrics.endRequest()` respectively.
- When creating agent handlers, pass `onRequestComplete` callback that calls `this.metrics.incrementMessages()` and optionally `this.metrics.incrementErrors()` on error.
- Pass `onModelCall` callback that the agent handler can call with latency/token data.
- Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: `metrics?: MetricsCollector`. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
2. **In `src/gateway/handlers/agent.ts`:**
- Add `metrics?: MetricsCollector` to `AgentHandlerDeps` (import type from `../metrics.js`)
- In `agent.send` handler:
- At start: `const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });`
- In the `try` block after `agent.process()` resolves: `deps.metrics?.incrementMessages();`
- In the `catch` block: `deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });`
- In `finally`: `deps.metrics?.endRequest(requestId);`
- For tool use events, record them to metrics: when `event.type === 'end'` and `event.result` and `!event.result.success`, increment error counter and record error event.
3. **In `src/gateway/server.ts` registerHandlers():**
- Pass `metrics: this.metrics` when constructing agent handler deps.
- Update the system handlers construction to pass the metrics accessors:
```
getMetrics: () => this.metrics.getSnapshot(),
getEvents: (opts) => this.metrics.getEvents(opts),
getActiveRequests: () => this.metrics.getActiveRequests(),
```
4. **Test the wiring:**
- In the existing `src/gateway/server.test.ts` or `src/gateway/handlers/handlers.test.ts`, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.
Run `pnpm test:run` and `pnpm typecheck`.
</action>
<verify>
`pnpm test:run` — all tests pass (1077 existing + new metrics tests).
`pnpm typecheck` — no type errors.
`grep -r "metrics\." src/gateway/handlers/agent.ts` — confirms metrics calls in agent handler.
</verify>
<done>
Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events.
Tool failures are recorded as error events.
System handlers return live metrics data from MetricsCollector.
All tests pass, no type errors.
</done>
</task>
</tasks>
<verification>
1. `pnpm test:run` — all 1077+ tests pass
2. `pnpm typecheck` — zero type errors
3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods())
4. GET /health returns valid JSON with status, uptime, version fields
5. MetricsCollector ring buffers enforce size limits
</verification>
<success_criteria>
- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
- Three new RPC handlers return metrics data
- GET /health endpoint returns unauthenticated JSON health status
- Agent request flow records messagesProcessed, errors, active requests, and error events
- Zero test regressions, all new tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md`
</output>