docs(03): create phase plan for live ops dashboard

2026-02-09 21:10:03 -08:00
parent fa4d6a057b
commit 94946eb7a8
3 changed files with 526 additions and 0 deletions
@@ -0,0 +1,255 @@
+---
+phase: 03-live-ops-dashboard
+plan: 01
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+  - src/gateway/metrics.ts
+  - src/gateway/metrics.test.ts
+  - src/gateway/handlers/system.ts
+  - src/gateway/server.ts
+  - src/daemon/services.ts
+autonomous: true
+
+must_haves:
+  truths:
+    - "MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)"
+    - "Gateway exposes system.metrics and system.events RPC methods returning accumulated data"
+    - "GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket"
+    - "Errors and significant events are captured in a ring buffer accessible via RPC"
+    - "Active agent requests are tracked (in-flight count, tool executions, session IDs)"
+  artifacts:
+    - path: "src/gateway/metrics.ts"
+      provides: "MetricsCollector class — single source of truth for all ops metrics"
+      exports: ["MetricsCollector"]
+    - path: "src/gateway/metrics.test.ts"
+      provides: "Tests for MetricsCollector"
+      contains: "describe.*MetricsCollector"
+    - path: "src/gateway/handlers/system.ts"
+      provides: "New system.metrics, system.events, system.activeRequests RPC handlers"
+      contains: "system.metrics"
+    - path: "src/gateway/server.ts"
+      provides: "HTTP /health endpoint and MetricsCollector wiring"
+      contains: "/health"
+    - path: "src/daemon/services.ts"
+      provides: "MetricsCollector creation and wiring into gateway"
+      contains: "MetricsCollector"
+  key_links:
+    - from: "src/gateway/server.ts"
+      to: "src/gateway/metrics.ts"
+      via: "GatewayServer holds MetricsCollector ref, passes to handlers"
+      pattern: "metrics.*MetricsCollector"
+    - from: "src/gateway/handlers/system.ts"
+      to: "src/gateway/metrics.ts"
+      via: "System handlers read from MetricsCollector"
+      pattern: "getMetrics|getEvents|getActiveRequests"
+    - from: "src/daemon/services.ts"
+      to: "src/gateway/metrics.ts"
+      via: "createGateway instantiates MetricsCollector"
+      pattern: "new MetricsCollector"
+---
+
+<objective>
+Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint.
+
+Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides.
+
+Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests.
+</objective>
+
+<execution_context>
+@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
+@/home/will/.config/opencode/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@src/gateway/server.ts
+@src/gateway/handlers/system.ts
+@src/gateway/handlers/index.ts
+@src/gateway/protocol.ts
+@src/gateway/router.ts
+@src/gateway/handlers/agent.ts
+@src/gateway/session-bridge.ts
+@src/daemon/services.ts
+@src/daemon/index.ts
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Create MetricsCollector and wire into gateway</name>
+  <files>
+    src/gateway/metrics.ts
+    src/gateway/metrics.test.ts
+    src/gateway/server.ts
+    src/gateway/handlers/system.ts
+    src/gateway/handlers/index.ts
+    src/daemon/services.ts
+  </files>
+  <action>
+Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks:
+
+**Counters (simple incrementing numbers):**
+- `messagesProcessed` — incremented each time an agent.send completes (success or error)
+- `errors` — incremented on agent.send errors and any other recorded errors
+- `activeRequests` — gauge (increment on start, decrement on end)
+
+**Model call metrics (ring buffer of recent calls, max 200 entries):**
+Each entry: `{ timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }`
+- `recordModelCall(entry)` — push to ring buffer
+- `getModelMetrics()` — return the array
+
+**Event stream (ring buffer of recent events, max 500 entries):**
+Each entry: `{ timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record<string, unknown> }`
+- `recordEvent(event)` — push to ring buffer
+- `getEvents(opts?: { level?: string, limit?: number })` — return filtered/limited array (newest first)
+
+**Active request tracking:**
+- `startRequest(id: string, info: { sessionId: string, channel: string })` — records start time + info
+- `endRequest(id: string)` — removes from active map
+- `getActiveRequests()` — returns array of `{ id, sessionId, channel, startedAt, durationMs }`
+
+**Snapshot method:**
+- `getSnapshot()` — returns `{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }`
+- Accept a `getQueueDepth` callback in constructor for LaneQueue integration
+
+The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types.
+
+**Wire MetricsCollector into the gateway:**
+
+1. In `src/gateway/server.ts`:
+   - Add `metrics?: MetricsCollector` to `GatewayServerConfig` interface
+   - Store the metrics instance on the GatewayServer class
+   - In `handleHttpRequest`, add a handler for `GET /health` BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: `{ status: 'ok', uptime: <seconds>, version: <string>, sessions: <count>, connections: <count>, tools: <count>, channels: <channelList> }`. Use the same data sources as `system.health` RPC handler. Set `Content-Type: application/json`.
+   - In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose `getMetrics()` accessor on GatewayServer so handlers can access it.
+
+2. In `src/gateway/handlers/system.ts`:
+   - Add `getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }` to `SystemHandlerDeps`
+   - Add `getEvents?: () => unknown[]` and `getActiveRequests?: () => unknown[]` to `SystemHandlerDeps`
+   - Add `system.metrics` handler: returns `getMetrics()` snapshot
+   - Add `system.events` handler: returns `getEvents()` with optional `level` and `limit` params
+   - Add `system.activeRequests` handler: returns `getActiveRequests()` array
+   - Update the re-exports in `src/gateway/handlers/index.ts` if any new types need exporting
+
+3. In `src/gateway/handlers/agent.ts` (or via a wrapper in server.ts):
+   - The metrics recording for agent.send happens naturally. In `src/gateway/server.ts`, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a `onRequestStart` and `onRequestEnd` callback to `AgentHandlerDeps` so the server can hook MetricsCollector in.
+
+4. In `src/daemon/services.ts`:
+   - In `createGateway()`, instantiate `new MetricsCollector({ getQueueDepth: () => 0 })` (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there).
+   - Pass it to the GatewayServer config as `metrics`.
+   - Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In `GatewayServerConfig`, just add `metricsEnabled?: boolean` (default true). The GatewayServer constructor creates `this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() })`.
+   - Add a `totalPending()` method to LaneQueue that sums all queue lengths across lanes.
+
+Wait — simpler approach that avoids changing too many files:
+- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts.
+- GatewayServer passes metrics callbacks to system handler deps and agent handler deps.
+- This keeps the metrics concern entirely within the gateway module.
+
+Use this simpler approach. Changes to `src/daemon/services.ts` are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically.
+
+**Update LaneQueue** (`src/gateway/lane-queue.ts`):
+- Add a `totalPending(): number` method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length).
+
+**Tests in `src/gateway/metrics.test.ts`:**
+- Test counter increment/decrement
+- Test model call ring buffer (max 200, FIFO eviction)
+- Test event ring buffer (max 500, FIFO eviction, filtering by level)
+- Test active request tracking (start, end, duration calculation)
+- Test getSnapshot returns correct shape
+- Test getEvents with level filter and limit
+
+Run `pnpm test:run` to verify zero regressions plus new tests pass.
+Run `pnpm typecheck` to verify no type errors.
+  </action>
+  <verify>
+`pnpm test:run` — all existing 1077 tests pass plus new metrics tests pass.
+`pnpm typecheck` — no type errors.
+`grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts` — confirms new RPC methods exist.
+`grep -r "GET.*health\|/health" src/gateway/server.ts` — confirms HTTP health endpoint exists.
+  </verify>
+  <done>
+MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking.
+Three new RPC handlers registered (system.metrics, system.events, system.activeRequests).
+GET /health returns unauthenticated JSON health status.
+LaneQueue has totalPending() method.
+All tests pass with zero regressions.
+  </done>
+</task>
+
+<task type="auto">
+  <name>Task 2: Hook metrics recording into agent request flow</name>
+  <files>
+    src/gateway/server.ts
+    src/gateway/handlers/agent.ts
+    src/gateway/lane-queue.ts
+  </files>
+  <action>
+Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system.
+
+1. **In `src/gateway/server.ts` registerHandlers():**
+   - When creating agent handlers, pass `onRequestStart` and `onRequestEnd` callbacks that call `this.metrics.startRequest()` and `this.metrics.endRequest()` respectively.
+   - When creating agent handlers, pass `onRequestComplete` callback that calls `this.metrics.incrementMessages()` and optionally `this.metrics.incrementErrors()` on error.
+   - Pass `onModelCall` callback that the agent handler can call with latency/token data.
+   - Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: `metrics?: MetricsCollector`. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks.
+
+2. **In `src/gateway/handlers/agent.ts`:**
+   - Add `metrics?: MetricsCollector` to `AgentHandlerDeps` (import type from `../metrics.js`)
+   - In `agent.send` handler:
+     - At start: `const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });`
+     - In the `try` block after `agent.process()` resolves: `deps.metrics?.incrementMessages();`
+     - In the `catch` block: `deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });`
+     - In `finally`: `deps.metrics?.endRequest(requestId);`
+   - For tool use events, record them to metrics: when `event.type === 'end'` and `event.result` and `!event.result.success`, increment error counter and record error event.
+
+3. **In `src/gateway/server.ts` registerHandlers():**
+   - Pass `metrics: this.metrics` when constructing agent handler deps.
+   - Update the system handlers construction to pass the metrics accessors:
+     ```
+     getMetrics: () => this.metrics.getSnapshot(),
+     getEvents: (opts) => this.metrics.getEvents(opts),
+     getActiveRequests: () => this.metrics.getActiveRequests(),
+     ```
+
+4. **Test the wiring:**
+   - In the existing `src/gateway/server.test.ts` or `src/gateway/handlers/handlers.test.ts`, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct.
+
+Run `pnpm test:run` and `pnpm typecheck`.
+  </action>
+  <verify>
+`pnpm test:run` — all tests pass (1077 existing + new metrics tests).
+`pnpm typecheck` — no type errors.
+`grep -r "metrics\." src/gateway/handlers/agent.ts` — confirms metrics calls in agent handler.
+  </verify>
+  <done>
+Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events.
+Tool failures are recorded as error events.
+System handlers return live metrics data from MetricsCollector.
+All tests pass, no type errors.
+  </done>
+</task>
+
+</tasks>
+
+<verification>
+1. `pnpm test:run` — all 1077+ tests pass
+2. `pnpm typecheck` — zero type errors
+3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods())
+4. GET /health returns valid JSON with status, uptime, version fields
+5. MetricsCollector ring buffers enforce size limits
+</verification>
+
+<success_criteria>
+- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking
+- Three new RPC handlers return metrics data
+- GET /health endpoint returns unauthenticated JSON health status
+- Agent request flow records messagesProcessed, errors, active requests, and error events
+- Zero test regressions, all new tests pass
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md`
+</output>
@@ -0,0 +1,260 @@
+---
+phase: 03-live-ops-dashboard
+plan: 02
+type: execute
+wave: 2
+depends_on: ["03-01"]
+files_modified:
+  - src/gateway/ui/pages/dashboard.js
+  - src/gateway/ui/style.css
+  - src/gateway/ui/index.html
+  - src/gateway/ui/lib/ws-client.js
+autonomous: false
+
+must_haves:
+  truths:
+    - "Dashboard shows live-updating counters: messages processed, active sessions, queue depth, daemon uptime — values change in real time"
+    - "Dashboard shows model call metrics: per-call latency, tokens/sec throughput, error rates by provider"
+    - "Dashboard shows live event stream: scrollable log of errors and events with timestamps, auto-scrolls on new entries"
+    - "Dashboard shows active request tracking: in-flight requests with duration and session info"
+    - "Dashboard auto-refreshes every 3 seconds for counters and events, maintaining live feel"
+  artifacts:
+    - path: "src/gateway/ui/pages/dashboard.js"
+      provides: "Enhanced dashboard page with metrics, events, and active request sections"
+      min_lines: 200
+    - path: "src/gateway/ui/style.css"
+      provides: "New CSS classes for event stream, metrics cards, active requests table"
+      contains: "event-stream"
+    - path: "src/gateway/ui/index.html"
+      provides: "Unchanged structure (dashboard page already registered)"
+    - path: "src/gateway/ui/lib/ws-client.js"
+      provides: "No changes needed (call() method already supports the new RPC methods)"
+  key_links:
+    - from: "src/gateway/ui/pages/dashboard.js"
+      to: "system.metrics"
+      via: "client.call('system.metrics')"
+      pattern: "client\\.call.*system\\.metrics"
+    - from: "src/gateway/ui/pages/dashboard.js"
+      to: "system.events"
+      via: "client.call('system.events')"
+      pattern: "client\\.call.*system\\.events"
+    - from: "src/gateway/ui/pages/dashboard.js"
+      to: "system.activeRequests"
+      via: "client.call('system.activeRequests')"
+      pattern: "client\\.call.*system\\.activeRequests"
+---
+
+<objective>
+Extend the existing vanilla JS dashboard with live ops sections: core counters, model call metrics, event stream, and active request tracking.
+
+Purpose: This is the user-facing deliverable — the operator opens the dashboard and sees real-time system health without tailing logs. All data comes from the RPC handlers created in Plan 01.
+
+Output: Enhanced dashboard.js with four new sections, supporting CSS, human-verified live dashboard.
+</objective>
+
+<execution_context>
+@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
+@/home/will/.config/opencode/get-shit-done/templates/summary.md
+</execution_context>
+
+<context>
+@.planning/PROJECT.md
+@.planning/ROADMAP.md
+@.planning/STATE.md
+@.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md
+@src/gateway/ui/pages/dashboard.js
+@src/gateway/ui/style.css
+@src/gateway/ui/index.html
+@src/gateway/ui/app.js
+@src/gateway/ui/lib/ws-client.js
+</context>
+
+<tasks>
+
+<task type="auto">
+  <name>Task 1: Extend dashboard page with live ops sections</name>
+  <files>
+    src/gateway/ui/pages/dashboard.js
+    src/gateway/ui/style.css
+  </files>
+  <action>
+**IMPORTANT: Extend the existing vanilla JS dashboard — do NOT replace with React or any framework. This is a locked user decision.**
+
+Rewrite `src/gateway/ui/pages/dashboard.js` to show four sections (replacing the current simple health/channels/usage layout):
+
+**Section 1: Core Counters (top row of stat cards)**
+- Messages Processed (from `system.metrics` → messagesProcessed)
+- Active Sessions (from `system.health` → sessions)
+- Queue Depth (from `system.metrics` → queueDepth)
+- Daemon Uptime (from `system.metrics` → uptime, formatted as "Xd Xh Xm Xs")
+- Active Requests (from `system.metrics` → activeRequests)
+- Errors (from `system.metrics` → errors, colored red if > 0)
+
+Use the existing `.stats-grid` and `.stat-card` CSS classes.
+
+**Section 2: Model Performance (table of recent model calls)**
+- Show the most recent 20 model calls from `system.metrics` → modelCalls.recentCalls
+- Table columns: Time (relative, e.g. "3s ago"), Provider, Latency (ms), Tokens/sec, In/Out tokens, Status (✓ or ✗)
+- Summary row above the table: Total calls, Avg latency, Error rate %
+- Use existing table CSS classes
+
+**Section 3: Event Stream (scrollable log)**
+- Fetch from `system.events` with `{ limit: 50 }`
+- Each event rendered as a row: `[HH:MM:SS] [LEVEL] source: message`
+- Color-code: error=red, warn=yellow, info=default
+- Container has max-height with overflow-y: auto and auto-scrolls to bottom on new entries
+- New class `.event-stream` for the container, `.event-row` for each entry, `.event-level-error`, `.event-level-warn`, `.event-level-info` for coloring
+
+**Section 4: Active Requests (table, only shown when requests in flight)**
+- Fetch from `system.activeRequests`
+- Table columns: Session, Channel, Duration (live-updating), Started
+- If no active requests, show "No active requests" muted text
+- Use existing table CSS
+
+**Section 5: Channels (keep existing)**
+- Keep the existing channels grid showing connected/disconnected channel adapters
+
+**Refresh strategy:**
+- Replace the current 10-second interval with a 3-second interval for the core data (system.metrics, system.events, system.activeRequests)
+- Fetch system.health and system.channels every 10 seconds (less dynamic data)
+- Use `Promise.all` to batch the frequent calls together
+- Keep the existing `teardown()` pattern with `clearInterval`
+
+**Implementation approach:**
+- Keep the same module pattern: `loadDashboard(el, client)` function + `DashboardPage` export with `render`/`teardown`
+- Use two timers: `_fastTimer` (3s) for metrics/events/requests, `_slowTimer` (10s) for health/channels
+- On first render, fetch everything with `Promise.all`
+- On subsequent fast ticks, only update the dynamic sections (don't re-render the whole page — use targeted DOM updates via `getElementById` for each section)
+- Generate unique section IDs: `#ops-counters`, `#ops-model-table`, `#ops-events`, `#ops-requests`, `#ops-channels`
+
+**CSS additions in `src/gateway/ui/style.css`:**
+Add at the end of the file (before the responsive section):
+
+```css
+/* ── Event Stream ──────────────────────────────────────── */
+.event-stream {
+  max-height: 300px;
+  overflow-y: auto;
+  background-color: var(--bg-secondary);
+  border: 1px solid var(--border);
+  border-radius: var(--radius);
+  padding: 8px;
+  font-size: var(--font-size-sm);
+  font-family: var(--font-mono);
+}
+
+.event-row {
+  padding: 4px 8px;
+  border-bottom: 1px solid var(--border-light);
+  white-space: pre-wrap;
+  word-break: break-word;
+}
+
+.event-row:last-child {
+  border-bottom: none;
+}
+
+.event-level-error { color: var(--error); }
+.event-level-warn { color: var(--warning); }
+.event-level-info { color: var(--text-secondary); }
+
+/* ── Model Metrics Summary ─────────────────────────────── */
+.metrics-summary {
+  display: flex;
+  gap: 24px;
+  margin-bottom: 12px;
+  font-size: var(--font-size-sm);
+  color: var(--text-secondary);
+}
+
+.metrics-summary .metric {
+  display: flex;
+  gap: 6px;
+}
+
+.metrics-summary .metric-value {
+  font-weight: 600;
+  color: var(--text-primary);
+}
+```
+
+**Keep the formatUptime helper** — it already exists and works perfectly.
+
+**Avoid:** Do NOT add animations or transitions. Do NOT import external libraries. Do NOT use template literals with innerHTML for the fast-update path — use targeted textContent/innerHTML updates on specific elements to avoid flicker.
+  </action>
+  <verify>
+`pnpm typecheck` — no type errors (vanilla JS won't affect this, but ensures no TS regressions).
+`pnpm build` — builds successfully (UI files are served as static assets, not compiled).
+Manual check: Open `src/gateway/ui/pages/dashboard.js` and verify it:
+  - Calls `client.call('system.metrics')`
+  - Calls `client.call('system.events')`
+  - Calls `client.call('system.activeRequests')`
+  - Has 3-second and 10-second refresh timers
+  - Has `teardown()` that cleans up both timers
+  </verify>
+  <done>
+Dashboard page shows five sections: core counters, model performance table, event stream, active requests, and channels.
+Counters and events refresh every 3 seconds.
+Health and channels refresh every 10 seconds.
+Event stream auto-scrolls and is color-coded by level.
+Active requests section shows in-flight requests or "no active requests" message.
+All existing stat-card and table CSS reused; new event-stream CSS added.
+  </done>
+</task>
+
+<task type="checkpoint:human-verify" gate="blocking">
+  <name>Task 2: Verify live dashboard in browser</name>
+  <files>src/gateway/ui/pages/dashboard.js</files>
+  <action>
+Human verification of the live dashboard. What was built:
+- Live ops dashboard with real-time metrics, event stream, model performance table, active request tracking, and HTTP /health endpoint
+- Extended the existing vanilla JS dashboard (no framework replacement)
+
+Steps to verify:
+1. Start Flynn: `pnpm dev`
+2. Open the dashboard in a browser (default: http://localhost:3100 or configured port)
+3. Verify the dashboard shows:
+   - Core counters row: Messages Processed, Active Sessions, Queue Depth, Uptime, Active Requests, Errors
+   - Model Performance section: table of recent model calls (may be empty if no messages sent yet)
+   - Event Stream section: scrollable log (may show startup events)
+   - Active Requests section: "No active requests" or table
+   - Channels section: connected channel adapters
+4. Send a message through the chat page (or via a connected channel) and verify:
+   - Messages Processed counter increments within 3 seconds
+   - Model Performance table shows the new call with latency and tokens/sec
+   - Event stream shows relevant entries
+5. Trigger an error (e.g., send a message that causes a tool error) and verify it appears in the event stream in red
+6. Test HTTP /health: `curl http://localhost:3100/health` — should return JSON with status, uptime, version
+7. Run `pnpm test:run` — all tests pass
+
+Resume signal: Type "approved" or describe issues.
+  </action>
+  <verify>Human confirms dashboard displays correctly and updates in real-time.</verify>
+  <done>Dashboard visually confirmed working with live-updating metrics, event stream, and model performance data.</done>
+</task>
+
+</tasks>
+
+<verification>
+1. Dashboard loads without errors in browser console
+2. All five sections render with real data
+3. Counters update within 3 seconds of events occurring
+4. Event stream is scrollable and color-coded
+5. `curl /health` returns valid JSON
+6. `pnpm test:run` — all tests pass
+7. `pnpm typecheck` — zero type errors
+</verification>
+
+<success_criteria>
+- Dashboard shows live-updating counters that change as messages flow (DASH-01)
+- Model call metrics visible with latency and tokens/sec (DASH-02)
+- Event stream shows errors with timestamps and context (DASH-03)
+- Active requests tracked and displayed (DASH-04)
+- GET /health returns JSON status (DASH-05)
+- Existing dashboard pages (chat, sessions, usage, settings) unaffected
+- Zero test regressions
+</success_criteria>
+
+<output>
+After completion, create `.planning/phases/03-live-ops-dashboard/03-02-SUMMARY.md`
+</output>