docs(03): create phase plan for live ops dashboard
This commit is contained in:
@@ -0,0 +1,255 @@
|
||||
---
|
||||
phase: 03-live-ops-dashboard
|
||||
plan: 01
|
||||
type: execute
|
||||
wave: 1
|
||||
depends_on: []
|
||||
files_modified:
|
||||
- src/gateway/metrics.ts
|
||||
- src/gateway/metrics.test.ts
|
||||
- src/gateway/handlers/system.ts
|
||||
- src/gateway/server.ts
|
||||
- src/daemon/services.ts
|
||||
autonomous: true
|
||||
|
||||
must_haves:
|
||||
truths:
|
||||
- "MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)"
|
||||
- "Gateway exposes system.metrics and system.events RPC methods returning accumulated data"
|
||||
- "GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket"
|
||||
- "Errors and significant events are captured in a ring buffer accessible via RPC"
|
||||
- "Active agent requests are tracked (in-flight count, tool executions, session IDs)"
|
||||
artifacts:
|
||||
- path: "src/gateway/metrics.ts"
|
||||
provides: "MetricsCollector class — single source of truth for all ops metrics"
|
||||
exports: ["MetricsCollector"]
|
||||
- path: "src/gateway/metrics.test.ts"
|
||||
provides: "Tests for MetricsCollector"
|
||||
contains: "describe.*MetricsCollector"
|
||||
- path: "src/gateway/handlers/system.ts"
|
||||
provides: "New system.metrics, system.events, system.activeRequests RPC handlers"
|
||||
contains: "system.metrics"
|
||||
- path: "src/gateway/server.ts"
|
||||
provides: "HTTP /health endpoint and MetricsCollector wiring"
|
||||
contains: "/health"
|
||||
- path: "src/daemon/services.ts"
|
||||
provides: "MetricsCollector creation and wiring into gateway"
|
||||
contains: "MetricsCollector"
|
||||
key_links:
|
||||
- from: "src/gateway/server.ts"
|
||||
to: "src/gateway/metrics.ts"
|
||||
via: "GatewayServer holds MetricsCollector ref, passes to handlers"
|
||||
pattern: "metrics.*MetricsCollector"
|
||||
- from: "src/gateway/handlers/system.ts"
|
||||
to: "src/gateway/metrics.ts"
|
||||
via: "System handlers read from MetricsCollector"
|
||||
pattern: "getMetrics|getEvents|getActiveRequests"
|
||||
- from: "src/daemon/services.ts"
|
||||
to: "src/gateway/metrics.ts"
|
||||
via: "createGateway instantiates MetricsCollector"
|
||||
pattern: "new MetricsCollector"
|
||||
---
|
||||
|
||||
<objective>
|
||||
Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint.
|
||||
|
||||
Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.
|
||||
|
||||
Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.
|
||||
</objective>
|
||||
|
||||
<execution_context>
|
||||
@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
|
||||
@/home/will/.config/opencode/get-shit-done/templates/summary.md
|
||||
</execution_context>
|
||||
|
||||
<context>
|
||||
@.planning/PROJECT.md
|
||||
@.planning/ROADMAP.md
|
||||
@.planning/STATE.md
|
||||
@src/gateway/server.ts
|
||||
@src/gateway/handlers/system.ts
|
||||
@src/gateway/handlers/index.ts
|
||||
@src/gateway/protocol.ts
|
||||
@src/gateway/router.ts
|
||||
@src/gateway/handlers/agent.ts
|
||||
@src/gateway/session-bridge.ts
|
||||
@src/daemon/services.ts
|
||||
@src/daemon/index.ts
|
||||
</context>
|
||||
|
||||
<tasks>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 1: Create MetricsCollector and wire into gateway</name>
|
||||
<files>
|
||||
src/gateway/metrics.ts
|
||||
src/gateway/metrics.test.ts
|
||||
src/gateway/server.ts
|
||||
src/gateway/handlers/system.ts
|
||||
src/gateway/handlers/index.ts
|
||||
src/daemon/services.ts
|
||||
</files>
|
||||
<action>
|
||||
Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:
|
||||
|
||||
**Counters (simple incrementing numbers):**
|
||||
- `messagesProcessed` — incremented each time an agent.send completes (success or error)
|
||||
- `errors` — incremented on agent.send errors and any other recorded errors
|
||||
- `activeRequests` — gauge (increment on start, decrement on end)
|
||||
|
||||
**Model call metrics (ring buffer of recent calls, max 200 entries):**
|
||||
Each entry: `{ timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }`
|
||||
- `recordModelCall(entry)` — push to ring buffer
|
||||
- `getModelMetrics()` — return the array
|
||||
|
||||
**Event stream (ring buffer of recent events, max 500 entries):**
|
||||
Each entry: `{ timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }`
|
||||
- `recordEvent(event)` — push to ring buffer
|
||||
- `getEvents(opts?: { level?: string, limit?: number })` — return filtered/limited array (newest first)
|
||||
|
||||
**Active request tracking:**
|
||||
- `startRequest(id: string, info: { sessionId: string, channel: string })` — records start time + info
|
||||
- `endRequest(id: string)` — removes from active map
|
||||
- `getActiveRequests()` — returns array of `{ id, sessionId, channel, startedAt, durationMs }`
|
||||
|
||||
**Snapshot method:**
|
||||
- `getSnapshot()` — returns `{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }`
|
||||
- Accept a `getQueueDepth` callback in constructor for LaneQueue integration
|
||||
|
||||
The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.
|
||||
|
||||
**Wire MetricsCollector into the gateway:**
|
||||
|
||||
1. In `src/gateway/server.ts`:
|
||||
- Add `metrics?: MetricsCollector` to `GatewayServerConfig` interface
|
||||
- Store the metrics instance on the GatewayServer class
|
||||
- In `handleHttpRequest`, add a handler for `GET /health` BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: `{ status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }`. Use the same data sources as `system.health` RPC handler. Set `Content-Type: application/json`.
|
||||
- In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose `getMetrics()` accessor on GatewayServer so handlers can access it.
|
||||
|
||||
2. In `src/gateway/handlers/system.ts`:
|
||||
- Add `getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }` to `SystemHandlerDeps`
|
||||
- Add `getEvents?: () => unknown[]` and `getActiveRequests?: () => unknown[]` to `SystemHandlerDeps`
|
||||
- Add `system.metrics` handler: returns `getMetrics()` snapshot
|
||||
- Add `system.events` handler: returns `getEvents()` with optional `level` and `limit` params
|
||||
- Add `system.activeRequests` handler: returns `getActiveRequests()` array
|
||||
- Update the re-exports in `src/gateway/handlers/index.ts` if any new types need exporting
|
||||
|
||||
3. In `src/gateway/handlers/agent.ts` (or via a wrapper in server.ts):
|
||||
- The metrics recording for agent.send happens naturally. In `src/gateway/server.ts`, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a `onRequestStart` and `onRequestEnd` callback to `AgentHandlerDeps` so the server can hook MetricsCollector in.
|
||||
|
||||
4. In `src/daemon/services.ts`:
|
||||
- In `createGateway()`, instantiate `new MetricsCollector({ getQueueDepth: () => 0 })` (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there).
|
||||
- Pass it to the GatewayServer config as `metrics`.
|
||||
- Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In `GatewayServerConfig`, just add `metricsEnabled?: boolean` (default true). The GatewayServer constructor creates `this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() })`.
|
||||
- Add a `totalPending()` method to LaneQueue that sums all queue lengths across lanes.
|
||||
|
||||
Wait — simpler approach that avoids changing too many files:
|
||||
- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
|
||||
- GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
|
||||
- This keeps the metrics concern entirely within the gateway module.
|
||||
|
||||
Use this simpler approach. Changes to `src/daemon/services.ts` are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.
|
||||
|
||||
**Update LaneQueue** (`src/gateway/lane-queue.ts`):
|
||||
- Add a `totalPending(): number` method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).
|
||||
|
||||
**Tests in `src/gateway/metrics.test.ts`:**
|
||||
- Test counter increment/decrement
|
||||
- Test model call ring buffer (max 200, FIFO eviction)
|
||||
- Test event ring buffer (max 500, FIFO eviction, filtering by level)
|
||||
- Test active request tracking (start, end, duration calculation)
|
||||
- Test getSnapshot returns correct shape
|
||||
- Test getEvents with level filter and limit
|
||||
|
||||
Run `pnpm test:run` to verify zero regressions plus new tests pass.
|
||||
Run `pnpm typecheck` to verify no type errors.
|
||||
</action>
|
||||
<verify>
|
||||
`pnpm test:run` — all existing 1077 tests pass plus new metrics tests pass.
|
||||
`pnpm typecheck` — no type errors.
|
||||
`grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts` — confirms new RPC methods exist.
|
||||
`grep -r "GET.*health\|/health" src/gateway/server.ts` — confirms HTTP health endpoint exists.
|
||||
</verify>
|
||||
<done>
|
||||
MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking.
|
||||
Three new RPC handlers registered (system.metrics, system.events, system.activeRequests).
|
||||
GET /health returns unauthenticated JSON health status.
|
||||
LaneQueue has totalPending() method.
|
||||
All tests pass with zero regressions.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
<task type="auto">
|
||||
<name>Task 2: Hook metrics recording into agent request flow</name>
|
||||
<files>
|
||||
src/gateway/server.ts
|
||||
src/gateway/handlers/agent.ts
|
||||
src/gateway/lane-queue.ts
|
||||
</files>
|
||||
<action>
|
||||
Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system.
|
||||
|
||||
1. **In `src/gateway/server.ts` registerHandlers():**
|
||||
- When creating agent handlers, pass `onRequestStart` and `onRequestEnd` callbacks that call `this.metrics.startRequest()` and `this.metrics.endRequest()` respectively.
|
||||
- When creating agent handlers, pass `onRequestComplete` callback that calls `this.metrics.incrementMessages()` and optionally `this.metrics.incrementErrors()` on error.
|
||||
- Pass `onModelCall` callback that the agent handler can call with latency/token data.
|
||||
- Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: `metrics?: MetricsCollector`. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
|
||||
|
||||
2. **In `src/gateway/handlers/agent.ts`:**
|
||||
- Add `metrics?: MetricsCollector` to `AgentHandlerDeps` (import type from `../metrics.js`)
|
||||
- In `agent.send` handler:
|
||||
- At start: `const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });`
|
||||
- In the `try` block after `agent.process()` resolves: `deps.metrics?.incrementMessages();`
|
||||
- In the `catch` block: `deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });`
|
||||
- In `finally`: `deps.metrics?.endRequest(requestId);`
|
||||
- For tool use events, record them to metrics: when `event.type === 'end'` and `event.result` and `!event.result.success`, increment error counter and record error event.
|
||||
|
||||
3. **In `src/gateway/server.ts` registerHandlers():**
|
||||
- Pass `metrics: this.metrics` when constructing agent handler deps.
|
||||
- Update the system handlers construction to pass the metrics accessors:
|
||||
```
|
||||
getMetrics: () => this.metrics.getSnapshot(),
|
||||
getEvents: (opts) => this.metrics.getEvents(opts),
|
||||
getActiveRequests: () => this.metrics.getActiveRequests(),
|
||||
```
|
||||
|
||||
4. **Test the wiring:**
|
||||
- In the existing `src/gateway/server.test.ts` or `src/gateway/handlers/handlers.test.ts`, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.
|
||||
|
||||
Run `pnpm test:run` and `pnpm typecheck`.
|
||||
</action>
|
||||
<verify>
|
||||
`pnpm test:run` — all tests pass (1077 existing + new metrics tests).
|
||||
`pnpm typecheck` — no type errors.
|
||||
`grep -r "metrics\." src/gateway/handlers/agent.ts` — confirms metrics calls in agent handler.
|
||||
</verify>
|
||||
<done>
|
||||
Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events.
|
||||
Tool failures are recorded as error events.
|
||||
System handlers return live metrics data from MetricsCollector.
|
||||
All tests pass, no type errors.
|
||||
</done>
|
||||
</task>
|
||||
|
||||
</tasks>
|
||||
|
||||
<verification>
|
||||
1. `pnpm test:run` — all 1077+ tests pass
|
||||
2. `pnpm typecheck` — zero type errors
|
||||
3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods())
|
||||
4. GET /health returns valid JSON with status, uptime, version fields
|
||||
5. MetricsCollector ring buffers enforce size limits
|
||||
</verification>
|
||||
|
||||
<success_criteria>
|
||||
- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
|
||||
- Three new RPC handlers return metrics data
|
||||
- GET /health endpoint returns unauthenticated JSON health status
|
||||
- Agent request flow records messagesProcessed, errors, active requests, and error events
|
||||
- Zero test regressions, all new tests pass
|
||||
</success_criteria>
|
||||
|
||||
<output>
|
||||
After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md`
|
||||
</output>
|
||||
Reference in New Issue
Block a user