docs(03): create phase plan for live ops dashboard

This commit is contained in:
William Valentin
2026-02-09 21:10:03 -08:00
parent fa4d6a057b
commit 94946eb7a8
3 changed files with 526 additions and 0 deletions
@@ -0,0 +1,255 @@
---
phase: 03-live-ops-dashboard
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- src/gateway/metrics.ts
- src/gateway/metrics.test.ts
- src/gateway/handlers/system.ts
- src/gateway/server.ts
- src/daemon/services.ts
autonomous: true
must_haves:
truths:
- "MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)"
- "Gateway exposes system.metrics and system.events RPC methods returning accumulated data"
- "GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket"
- "Errors and significant events are captured in a ring buffer accessible via RPC"
- "Active agent requests are tracked (in-flight count, tool executions, session IDs)"
artifacts:
- path: "src/gateway/metrics.ts"
provides: "MetricsCollector class — single source of truth for all ops metrics"
exports: ["MetricsCollector"]
- path: "src/gateway/metrics.test.ts"
provides: "Tests for MetricsCollector"
contains: "describe.*MetricsCollector"
- path: "src/gateway/handlers/system.ts"
provides: "New system.metrics, system.events, system.activeRequests RPC handlers"
contains: "system.metrics"
- path: "src/gateway/server.ts"
provides: "HTTP /health endpoint and MetricsCollector wiring"
contains: "/health"
- path: "src/daemon/services.ts"
provides: "MetricsCollector creation and wiring into gateway"
contains: "MetricsCollector"
key_links:
- from: "src/gateway/server.ts"
to: "src/gateway/metrics.ts"
via: "GatewayServer holds MetricsCollector ref, passes to handlers"
pattern: "metrics.*MetricsCollector"
- from: "src/gateway/handlers/system.ts"
to: "src/gateway/metrics.ts"
via: "System handlers read from MetricsCollector"
pattern: "getMetrics|getEvents|getActiveRequests"
- from: "src/daemon/services.ts"
to: "src/gateway/metrics.ts"
via: "createGateway instantiates MetricsCollector"
pattern: "new MetricsCollector"
---
<objective>
Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint.
Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.
Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.
</objective>
<execution_context>
@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
@/home/will/.config/opencode/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@src/gateway/server.ts
@src/gateway/handlers/system.ts
@src/gateway/handlers/index.ts
@src/gateway/protocol.ts
@src/gateway/router.ts
@src/gateway/handlers/agent.ts
@src/gateway/session-bridge.ts
@src/daemon/services.ts
@src/daemon/index.ts
</context>
<tasks>
<task type="auto">
<name>Task 1: Create MetricsCollector and wire into gateway</name>
<files>
src/gateway/metrics.ts
src/gateway/metrics.test.ts
src/gateway/server.ts
src/gateway/handlers/system.ts
src/gateway/handlers/index.ts
src/daemon/services.ts
</files>
<action>
Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:
**Counters (simple incrementing numbers):**
- `messagesProcessed` — incremented each time an agent.send completes (success or error)
- `errors` — incremented on agent.send errors and any other recorded errors
- `activeRequests` — gauge (increment on start, decrement on end)
**Model call metrics (ring buffer of recent calls, max 200 entries):**
Each entry: `{ timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }`
- `recordModelCall(entry)` — push to ring buffer
- `getModelMetrics()` — return the array
**Event stream (ring buffer of recent events, max 500 entries):**
Each entry: `{ timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }`
- `recordEvent(event)` — push to ring buffer
- `getEvents(opts?: { level?: string, limit?: number })` — return filtered/limited array (newest first)
**Active request tracking:**
- `startRequest(id: string, info: { sessionId: string, channel: string })` — records start time + info
- `endRequest(id: string)` — removes from active map
- `getActiveRequests()` — returns array of `{ id, sessionId, channel, startedAt, durationMs }`
**Snapshot method:**
- `getSnapshot()` — returns `{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }`
- Accept a `getQueueDepth` callback in constructor for LaneQueue integration
The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.
**Wire MetricsCollector into the gateway:**
1. In `src/gateway/server.ts`:
- Add `metrics?: MetricsCollector` to `GatewayServerConfig` interface
- Store the metrics instance on the GatewayServer class
- In `handleHttpRequest`, add a handler for `GET /health` BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: `{ status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }`. Use the same data sources as `system.health` RPC handler. Set `Content-Type: application/json`.
- In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose `getMetrics()` accessor on GatewayServer so handlers can access it.
2. In `src/gateway/handlers/system.ts`:
- Add `getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }` to `SystemHandlerDeps`
- Add `getEvents?: () => unknown[]` and `getActiveRequests?: () => unknown[]` to `SystemHandlerDeps`
- Add `system.metrics` handler: returns `getMetrics()` snapshot
- Add `system.events` handler: returns `getEvents()` with optional `level` and `limit` params
- Add `system.activeRequests` handler: returns `getActiveRequests()` array
- Update the re-exports in `src/gateway/handlers/index.ts` if any new types need exporting
3. In `src/gateway/handlers/agent.ts` (or via a wrapper in server.ts):
- The metrics recording for agent.send happens naturally. In `src/gateway/server.ts`, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a `onRequestStart` and `onRequestEnd` callback to `AgentHandlerDeps` so the server can hook MetricsCollector in.
4. In `src/daemon/services.ts`:
- In `createGateway()`, instantiate `new MetricsCollector({ getQueueDepth: () => 0 })` (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there).
- Pass it to the GatewayServer config as `metrics`.
- Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In `GatewayServerConfig`, just add `metricsEnabled?: boolean` (default true). The GatewayServer constructor creates `this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() })`.
- Add a `totalPending()` method to LaneQueue that sums all queue lengths across lanes.
Wait — simpler approach that avoids changing too many files:
- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
- GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
- This keeps the metrics concern entirely within the gateway module.
Use this simpler approach. Changes to `src/daemon/services.ts` are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.
**Update LaneQueue** (`src/gateway/lane-queue.ts`):
- Add a `totalPending(): number` method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).
**Tests in `src/gateway/metrics.test.ts`:**
- Test counter increment/decrement
- Test model call ring buffer (max 200, FIFO eviction)
- Test event ring buffer (max 500, FIFO eviction, filtering by level)
- Test active request tracking (start, end, duration calculation)
- Test getSnapshot returns correct shape
- Test getEvents with level filter and limit
Run `pnpm test:run` to verify zero regressions plus new tests pass.
Run `pnpm typecheck` to verify no type errors.
</action>
<verify>
`pnpm test:run` — all existing 1077 tests pass plus new metrics tests pass.
`pnpm typecheck` — no type errors.
`grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts` — confirms new RPC methods exist.
`grep -r "GET.*health\|/health" src/gateway/server.ts` — confirms HTTP health endpoint exists.
</verify>
<done>
MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking.
Three new RPC handlers registered (system.metrics, system.events, system.activeRequests).
GET /health returns unauthenticated JSON health status.
LaneQueue has totalPending() method.
All tests pass with zero regressions.
</done>
</task>
<task type="auto">
<name>Task 2: Hook metrics recording into agent request flow</name>
<files>
src/gateway/server.ts
src/gateway/handlers/agent.ts
src/gateway/lane-queue.ts
</files>
<action>
Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system.
1. **In `src/gateway/server.ts` registerHandlers():**
- When creating agent handlers, pass `onRequestStart` and `onRequestEnd` callbacks that call `this.metrics.startRequest()` and `this.metrics.endRequest()` respectively.
- When creating agent handlers, pass `onRequestComplete` callback that calls `this.metrics.incrementMessages()` and optionally `this.metrics.incrementErrors()` on error.
- Pass `onModelCall` callback that the agent handler can call with latency/token data.
- Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: `metrics?: MetricsCollector`. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
2. **In `src/gateway/handlers/agent.ts`:**
- Add `metrics?: MetricsCollector` to `AgentHandlerDeps` (import type from `../metrics.js`)
- In `agent.send` handler:
- At start: `const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });`
- In the `try` block after `agent.process()` resolves: `deps.metrics?.incrementMessages();`
- In the `catch` block: `deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });`
- In `finally`: `deps.metrics?.endRequest(requestId);`
- For tool use events, record them to metrics: when `event.type === 'end'` and `event.result` and `!event.result.success`, increment error counter and record error event.
3. **In `src/gateway/server.ts` registerHandlers():**
- Pass `metrics: this.metrics` when constructing agent handler deps.
- Update the system handlers construction to pass the metrics accessors:
```
getMetrics: () => this.metrics.getSnapshot(),
getEvents: (opts) => this.metrics.getEvents(opts),
getActiveRequests: () => this.metrics.getActiveRequests(),
```
4. **Test the wiring:**
- In the existing `src/gateway/server.test.ts` or `src/gateway/handlers/handlers.test.ts`, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.
Run `pnpm test:run` and `pnpm typecheck`.
</action>
<verify>
`pnpm test:run` — all tests pass (1077 existing + new metrics tests).
`pnpm typecheck` — no type errors.
`grep -r "metrics\." src/gateway/handlers/agent.ts` — confirms metrics calls in agent handler.
</verify>
<done>
Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events.
Tool failures are recorded as error events.
System handlers return live metrics data from MetricsCollector.
All tests pass, no type errors.
</done>
</task>
</tasks>
<verification>
1. `pnpm test:run` — all 1077+ tests pass
2. `pnpm typecheck` — zero type errors
3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods())
4. GET /health returns valid JSON with status, uptime, version fields
5. MetricsCollector ring buffers enforce size limits
</verification>
<success_criteria>
- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
- Three new RPC handlers return metrics data
- GET /health endpoint returns unauthenticated JSON health status
- Agent request flow records messagesProcessed, errors, active requests, and error events
- Zero test regressions, all new tests pass
</success_criteria>
<output>
After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md`
</output>
@@ -0,0 +1,260 @@
---
phase: 03-live-ops-dashboard
plan: 02
type: execute
wave: 2
depends_on: ["03-01"]
files_modified:
- src/gateway/ui/pages/dashboard.js
- src/gateway/ui/style.css
- src/gateway/ui/index.html
- src/gateway/ui/lib/ws-client.js
autonomous: false
must_haves:
truths:
- "Dashboard shows live-updating counters: messages processed, active sessions, queue depth, daemon uptime — values change in real time"
- "Dashboard shows model call metrics: per-call latency, tokens/sec throughput, error rates by provider"
- "Dashboard shows live event stream: scrollable log of errors and events with timestamps, auto-scrolls on new entries"
- "Dashboard shows active request tracking: in-flight requests with duration and session info"
- "Dashboard auto-refreshes every 3 seconds for counters and events, maintaining live feel"
artifacts:
- path: "src/gateway/ui/pages/dashboard.js"
provides: "Enhanced dashboard page with metrics, events, and active request sections"
min_lines: 200
- path: "src/gateway/ui/style.css"
provides: "New CSS classes for event stream, metrics cards, active requests table"
contains: "event-stream"
- path: "src/gateway/ui/index.html"
provides: "Unchanged structure (dashboard page already registered)"
- path: "src/gateway/ui/lib/ws-client.js"
provides: "No changes needed (call() method already supports the new RPC methods)"
key_links:
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.metrics"
via: "client.call('system.metrics')"
pattern: "client\\.call.*system\\.metrics"
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.events"
via: "client.call('system.events')"
pattern: "client\\.call.*system\\.events"
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.activeRequests"
via: "client.call('system.activeRequests')"
pattern: "client\\.call.*system\\.activeRequests"
---
<objective>
Extend the existing vanilla JS dashboard with live ops sections: core counters, model call metrics, event stream, and active request tracking.
Purpose: This is the user-facing deliverable — the operator opens the dashboard and sees real-time system health without tailing logs. All data comes from the RPC handlers created in Plan 01.
Output: Enhanced dashboard.js with four new sections, supporting CSS, human-verified live dashboard.
</objective>
<execution_context>
@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
@/home/will/.config/opencode/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md
@src/gateway/ui/pages/dashboard.js
@src/gateway/ui/style.css
@src/gateway/ui/index.html
@src/gateway/ui/app.js
@src/gateway/ui/lib/ws-client.js
</context>
<tasks>
<task type="auto">
<name>Task 1: Extend dashboard page with live ops sections</name>
<files>
src/gateway/ui/pages/dashboard.js
src/gateway/ui/style.css
</files>
<action>
**IMPORTANT: Extend the existing vanilla JS dashboard — do NOT replace with React or any framework. This is a locked user decision.**
Rewrite `src/gateway/ui/pages/dashboard.js` to show four sections (replacing the current simple health/channels/usage layout):
**Section 1: Core Counters (top row of stat cards)**
- Messages Processed (from `system.metrics` → messagesProcessed)
- Active Sessions (from `system.health` → sessions)
- Queue Depth (from `system.metrics` → queueDepth)
- Daemon Uptime (from `system.metrics` → uptime, formatted as "Xd Xh Xm Xs")
- Active Requests (from `system.metrics` → activeRequests)
- Errors (from `system.metrics` → errors, colored red if > 0)
Use the existing `.stats-grid` and `.stat-card` CSS classes.
**Section 2: Model Performance (table of recent model calls)**
- Show the most recent 20 model calls from `system.metrics` → modelCalls.recentCalls
- Table columns: Time (relative, e.g. "3s ago"), Provider, Latency (ms), Tokens/sec, In/Out tokens, Status (✓ or ✗)
- Summary row above the table: Total calls, Avg latency, Error rate %
- Use existing table CSS classes
**Section 3: Event Stream (scrollable log)**
- Fetch from `system.events` with `{ limit: 50 }`
- Each event rendered as a row: `[HH:MM:SS] [LEVEL] source: message`
- Color-code: error=red, warn=yellow, info=default
- Container has max-height with overflow-y: auto and auto-scrolls to bottom on new entries
- New class `.event-stream` for the container, `.event-row` for each entry, `.event-level-error`, `.event-level-warn`, `.event-level-info` for coloring
**Section 4: Active Requests (table, only shown when requests in flight)**
- Fetch from `system.activeRequests`
- Table columns: Session, Channel, Duration (live-updating), Started
- If no active requests, show "No active requests" muted text
- Use existing table CSS
**Section 5: Channels (keep existing)**
- Keep the existing channels grid showing connected/disconnected channel adapters
**Refresh strategy:**
- Replace the current 10-second interval with a 3-second interval for the core data (system.metrics, system.events, system.activeRequests)
- Fetch system.health and system.channels every 10 seconds (less dynamic data)
- Use `Promise.all` to batch the frequent calls together
- Keep the existing `teardown()` pattern with `clearInterval`
**Implementation approach:**
- Keep the same module pattern: `loadDashboard(el, client)` function + `DashboardPage` export with `render`/`teardown`
- Use two timers: `_fastTimer` (3s) for metrics/events/requests, `_slowTimer` (10s) for health/channels
- On first render, fetch everything with `Promise.all`
- On subsequent fast ticks, only update the dynamic sections (don't re-render the whole page — use targeted DOM updates via `getElementById` for each section)
- Generate unique section IDs: `#ops-counters`, `#ops-model-table`, `#ops-events`, `#ops-requests`, `#ops-channels`
**CSS additions in `src/gateway/ui/style.css`:**
Add at the end of the file (before the responsive section):
```css
/* ── Event Stream ──────────────────────────────────────── */
.event-stream {
max-height: 300px;
overflow-y: auto;
background-color: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: var(--radius);
padding: 8px;
font-size: var(--font-size-sm);
font-family: var(--font-mono);
}
.event-row {
padding: 4px 8px;
border-bottom: 1px solid var(--border-light);
white-space: pre-wrap;
word-break: break-word;
}
.event-row:last-child {
border-bottom: none;
}
.event-level-error { color: var(--error); }
.event-level-warn { color: var(--warning); }
.event-level-info { color: var(--text-secondary); }
/* ── Model Metrics Summary ─────────────────────────────── */
.metrics-summary {
display: flex;
gap: 24px;
margin-bottom: 12px;
font-size: var(--font-size-sm);
color: var(--text-secondary);
}
.metrics-summary .metric {
display: flex;
gap: 6px;
}
.metrics-summary .metric-value {
font-weight: 600;
color: var(--text-primary);
}
```
**Keep the formatUptime helper** — it already exists and works perfectly.
**Avoid:** Do NOT add animations or transitions. Do NOT import external libraries. Do NOT use template literals with innerHTML for the fast-update path — use targeted textContent/innerHTML updates on specific elements to avoid flicker.
</action>
<verify>
`pnpm typecheck` — no type errors (vanilla JS won't affect this, but ensures no TS regressions).
`pnpm build` — builds successfully (UI files are served as static assets, not compiled).
Manual check: Open `src/gateway/ui/pages/dashboard.js` and verify it:
- Calls `client.call('system.metrics')`
- Calls `client.call('system.events')`
- Calls `client.call('system.activeRequests')`
- Has 3-second and 10-second refresh timers
- Has `teardown()` that cleans up both timers
</verify>
<done>
Dashboard page shows five sections: core counters, model performance table, event stream, active requests, and channels.
Counters and events refresh every 3 seconds.
Health and channels refresh every 10 seconds.
Event stream auto-scrolls and is color-coded by level.
Active requests section shows in-flight requests or "no active requests" message.
All existing stat-card and table CSS reused; new event-stream CSS added.
</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<name>Task 2: Verify live dashboard in browser</name>
<files>src/gateway/ui/pages/dashboard.js</files>
<action>
Human verification of the live dashboard. What was built:
- Live ops dashboard with real-time metrics, event stream, model performance table, active request tracking, and HTTP /health endpoint
- Extended the existing vanilla JS dashboard (no framework replacement)
Steps to verify:
1. Start Flynn: `pnpm dev`
2. Open the dashboard in a browser (default: http://localhost:3100 or configured port)
3. Verify the dashboard shows:
- Core counters row: Messages Processed, Active Sessions, Queue Depth, Uptime, Active Requests, Errors
- Model Performance section: table of recent model calls (may be empty if no messages sent yet)
- Event Stream section: scrollable log (may show startup events)
- Active Requests section: "No active requests" or table
- Channels section: connected channel adapters
4. Send a message through the chat page (or via a connected channel) and verify:
- Messages Processed counter increments within 3 seconds
- Model Performance table shows the new call with latency and tokens/sec
- Event stream shows relevant entries
5. Trigger an error (e.g., send a message that causes a tool error) and verify it appears in the event stream in red
6. Test HTTP /health: `curl http://localhost:3100/health` — should return JSON with status, uptime, version
7. Run `pnpm test:run` — all tests pass
Resume signal: Type "approved" or describe issues.
</action>
<verify>Human confirms dashboard displays correctly and updates in real-time.</verify>
<done>Dashboard visually confirmed working with live-updating metrics, event stream, and model performance data.</done>
</task>
</tasks>
<verification>
1. Dashboard loads without errors in browser console
2. All five sections render with real data
3. Counters update within 3 seconds of events occurring
4. Event stream is scrollable and color-coded
5. `curl /health` returns valid JSON
6. `pnpm test:run` — all tests pass
7. `pnpm typecheck` — zero type errors
</verification>
<success_criteria>
- Dashboard shows live-updating counters that change as messages flow (DASH-01)
- Model call metrics visible with latency and tokens/sec (DASH-02)
- Event stream shows errors with timestamps and context (DASH-03)
- Active requests tracked and displayed (DASH-04)
- GET /health returns JSON status (DASH-05)
- Existing dashboard pages (chat, sessions, usage, settings) unaffected
- Zero test regressions
</success_criteria>
<output>
After completion, create `.planning/phases/03-live-ops-dashboard/03-02-SUMMARY.md`
</output>