14 KiB
phase, plan, type, wave, depends_on, files_modified, autonomous, must_haves
| phase | plan | type | wave | depends_on | files_modified | autonomous | must_haves | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 03-live-ops-dashboard | 01 | execute | 1 |
|
true |
|
Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.
Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.
<execution_context> @/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md @/home/will/.config/opencode/get-shit-done/templates/summary.md </execution_context>
@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @src/gateway/server.ts @src/gateway/handlers/system.ts @src/gateway/handlers/index.ts @src/gateway/protocol.ts @src/gateway/router.ts @src/gateway/handlers/agent.ts @src/gateway/session-bridge.ts @src/daemon/services.ts @src/daemon/index.ts Task 1: Create MetricsCollector and wire into gateway src/gateway/metrics.ts src/gateway/metrics.test.ts src/gateway/server.ts src/gateway/handlers/system.ts src/gateway/handlers/index.ts src/daemon/services.ts Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:Counters (simple incrementing numbers):
messagesProcessed— incremented each time an agent.send completes (success or error)errors— incremented on agent.send errors and any other recorded errorsactiveRequests— gauge (increment on start, decrement on end)
Model call metrics (ring buffer of recent calls, max 200 entries):
Each entry: { timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }
recordModelCall(entry)— push to ring buffergetModelMetrics()— return the array
Event stream (ring buffer of recent events, max 500 entries):
Each entry: { timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }
recordEvent(event)— push to ring buffergetEvents(opts?: { level?: string, limit?: number })— return filtered/limited array (newest first)
Active request tracking:
startRequest(id: string, info: { sessionId: string, channel: string })— records start time + infoendRequest(id: string)— removes from active mapgetActiveRequests()— returns array of{ id, sessionId, channel, startedAt, durationMs }
Snapshot method:
getSnapshot()— returns{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }- Accept a
getQueueDepthcallback in constructor for LaneQueue integration
The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.
Wire MetricsCollector into the gateway:
-
In
src/gateway/server.ts:- Add
metrics?: MetricsCollectortoGatewayServerConfiginterface - Store the metrics instance on the GatewayServer class
- In
handleHttpRequest, add a handler forGET /healthBEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON:{ status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }. Use the same data sources assystem.healthRPC handler. SetContent-Type: application/json. - In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose
getMetrics()accessor on GatewayServer so handlers can access it.
- Add
-
In
src/gateway/handlers/system.ts:- Add
getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }toSystemHandlerDeps - Add
getEvents?: () => unknown[]andgetActiveRequests?: () => unknown[]toSystemHandlerDeps - Add
system.metricshandler: returnsgetMetrics()snapshot - Add
system.eventshandler: returnsgetEvents()with optionallevelandlimitparams - Add
system.activeRequestshandler: returnsgetActiveRequests()array - Update the re-exports in
src/gateway/handlers/index.tsif any new types need exporting
- Add
-
In
src/gateway/handlers/agent.ts(or via a wrapper in server.ts):- The metrics recording for agent.send happens naturally. In
src/gateway/server.ts, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add aonRequestStartandonRequestEndcallback toAgentHandlerDepsso the server can hook MetricsCollector in.
- The metrics recording for agent.send happens naturally. In
-
In
src/daemon/services.ts:- In
createGateway(), instantiatenew MetricsCollector({ getQueueDepth: () => 0 })(queue depth from LaneQueue is internal to GatewayServer; we'll wire it there). - Pass it to the GatewayServer config as
metrics. - Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In
GatewayServerConfig, just addmetricsEnabled?: boolean(default true). The GatewayServer constructor createsthis.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() }). - Add a
totalPending()method to LaneQueue that sums all queue lengths across lanes.
- In
Wait — simpler approach that avoids changing too many files:
- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
- GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
- This keeps the metrics concern entirely within the gateway module.
Use this simpler approach. Changes to src/daemon/services.ts are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.
Update LaneQueue (src/gateway/lane-queue.ts):
- Add a
totalPending(): numbermethod that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).
Tests in src/gateway/metrics.test.ts:
- Test counter increment/decrement
- Test model call ring buffer (max 200, FIFO eviction)
- Test event ring buffer (max 500, FIFO eviction, filtering by level)
- Test active request tracking (start, end, duration calculation)
- Test getSnapshot returns correct shape
- Test getEvents with level filter and limit
Run pnpm test:run to verify zero regressions plus new tests pass.
Run pnpm typecheck to verify no type errors.
pnpm test:run — all existing 1077 tests pass plus new metrics tests pass.
pnpm typecheck — no type errors.
grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts — confirms new RPC methods exist.
grep -r "GET.*health\|/health" src/gateway/server.ts — confirms HTTP health endpoint exists.
MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking.
Three new RPC handlers registered (system.metrics, system.events, system.activeRequests).
GET /health returns unauthenticated JSON health status.
LaneQueue has totalPending() method.
All tests pass with zero regressions.
-
In
src/gateway/server.tsregisterHandlers():- When creating agent handlers, pass
onRequestStartandonRequestEndcallbacks that callthis.metrics.startRequest()andthis.metrics.endRequest()respectively. - When creating agent handlers, pass
onRequestCompletecallback that callsthis.metrics.incrementMessages()and optionallythis.metrics.incrementErrors()on error. - Pass
onModelCallcallback that the agent handler can call with latency/token data. - Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps:
metrics?: MetricsCollector. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
- When creating agent handlers, pass
-
In
src/gateway/handlers/agent.ts:- Add
metrics?: MetricsCollectortoAgentHandlerDeps(import type from../metrics.js) - In
agent.sendhandler:- At start:
const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' }); - In the
tryblock afteragent.process()resolves:deps.metrics?.incrementMessages(); - In the
catchblock:deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } }); - In
finally:deps.metrics?.endRequest(requestId);
- At start:
- For tool use events, record them to metrics: when
event.type === 'end'andevent.resultand!event.result.success, increment error counter and record error event.
- Add
-
In
src/gateway/server.tsregisterHandlers():- Pass
metrics: this.metricswhen constructing agent handler deps. - Update the system handlers construction to pass the metrics accessors:
getMetrics: () => this.metrics.getSnapshot(), getEvents: (opts) => this.metrics.getEvents(opts), getActiveRequests: () => this.metrics.getActiveRequests(),
- Pass
-
Test the wiring:
- In the existing
src/gateway/server.test.tsorsrc/gateway/handlers/handlers.test.ts, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.
- In the existing
Run pnpm test:run and pnpm typecheck.
pnpm test:run — all tests pass (1077 existing + new metrics tests).
pnpm typecheck — no type errors.
grep -r "metrics\." src/gateway/handlers/agent.ts — confirms metrics calls in agent handler.
Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events.
Tool failures are recorded as error events.
System handlers return live metrics data from MetricsCollector.
All tests pass, no type errors.
<success_criteria>
- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
- Three new RPC handlers return metrics data
- GET /health endpoint returns unauthenticated JSON health status
- Agent request flow records messagesProcessed, errors, active requests, and error events
- Zero test regressions, all new tests pass </success_criteria>