diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index af46fbf..b0471c1 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -64,6 +64,17 @@ Plans: **Requirements:** DASH-01, DASH-02, DASH-03, DASH-04, DASH-05 +**Plans:** 2 plans in 2 waves + +Plans: +- [ ] 03-01-PLAN.md — Backend metrics collector, RPC handlers, HTTP /health endpoint +- [ ] 03-02-PLAN.md — Dashboard UI with live counters, model metrics, event stream, active requests + +| Plan | Wave | Objective | Tasks | +|------|------|-----------|-------| +| 03-01 | 1 | MetricsCollector + RPC handlers + /health endpoint | 2 | +| 03-02 | 2 | Extend vanilla JS dashboard with live ops sections | 2 | + **Success Criteria:** 1. Opening the dashboard shows live-updating counters for messages processed, active sessions, queue depth, and uptime — values change in real time as messages flow 2. After sending a message through any channel, the model call appears in the dashboard with latency, tokens/sec, and provider name within seconds diff --git a/.planning/phases/03-live-ops-dashboard/03-01-PLAN.md b/.planning/phases/03-live-ops-dashboard/03-01-PLAN.md new file mode 100644 index 0000000..a5374b7 --- /dev/null +++ b/.planning/phases/03-live-ops-dashboard/03-01-PLAN.md @@ -0,0 +1,255 @@ +--- +phase: 03-live-ops-dashboard +plan: 01 +type: execute +wave: 1 +depends_on: [] +files_modified: + - src/gateway/metrics.ts + - src/gateway/metrics.test.ts + - src/gateway/handlers/system.ts + - src/gateway/server.ts + - src/daemon/services.ts +autonomous: true + +must_haves: + truths: + - "MetricsCollector accumulates counters (messages processed, errors) and model call metrics (latency, tokens/sec, provider)" + - "Gateway exposes system.metrics and system.events RPC methods returning accumulated data" + - "GET /health returns JSON with daemon status, uptime, and component readiness without WebSocket" + - "Errors and significant events are captured in a ring buffer accessible via RPC" + - "Active agent requests are tracked (in-flight count, tool executions, session IDs)" + artifacts: + - path: "src/gateway/metrics.ts" + provides: "MetricsCollector class — single source of truth for all ops metrics" + exports: ["MetricsCollector"] + - path: "src/gateway/metrics.test.ts" + provides: "Tests for MetricsCollector" + contains: "describe.*MetricsCollector" + - path: "src/gateway/handlers/system.ts" + provides: "New system.metrics, system.events, system.activeRequests RPC handlers" + contains: "system.metrics" + - path: "src/gateway/server.ts" + provides: "HTTP /health endpoint and MetricsCollector wiring" + contains: "/health" + - path: "src/daemon/services.ts" + provides: "MetricsCollector creation and wiring into gateway" + contains: "MetricsCollector" + key_links: + - from: "src/gateway/server.ts" + to: "src/gateway/metrics.ts" + via: "GatewayServer holds MetricsCollector ref, passes to handlers" + pattern: "metrics.*MetricsCollector" + - from: "src/gateway/handlers/system.ts" + to: "src/gateway/metrics.ts" + via: "System handlers read from MetricsCollector" + pattern: "getMetrics|getEvents|getActiveRequests" + - from: "src/daemon/services.ts" + to: "src/gateway/metrics.ts" + via: "createGateway instantiates MetricsCollector" + pattern: "new MetricsCollector" +--- + + +Create the metrics collection backend and wire it into the gateway server with new RPC handlers and an HTTP /health endpoint. + +Purpose: Provide the data layer that the dashboard UI (Plan 02) will consume. Without collected metrics, the dashboard has nothing to show beyond what system.health already provides. + +Output: MetricsCollector class, 3 new RPC methods (system.metrics, system.events, system.activeRequests), HTTP GET /health endpoint, tests. + + + +@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md +@/home/will/.config/opencode/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@src/gateway/server.ts +@src/gateway/handlers/system.ts +@src/gateway/handlers/index.ts +@src/gateway/protocol.ts +@src/gateway/router.ts +@src/gateway/handlers/agent.ts +@src/gateway/session-bridge.ts +@src/daemon/services.ts +@src/daemon/index.ts + + + + + + Task 1: Create MetricsCollector and wire into gateway + + src/gateway/metrics.ts + src/gateway/metrics.test.ts + src/gateway/server.ts + src/gateway/handlers/system.ts + src/gateway/handlers/index.ts + src/daemon/services.ts + + +Create `src/gateway/metrics.ts` with a `MetricsCollector` class that tracks: + +**Counters (simple incrementing numbers):** +- `messagesProcessed` — incremented each time an agent.send completes (success or error) +- `errors` — incremented on agent.send errors and any other recorded errors +- `activeRequests` — gauge (increment on start, decrement on end) + +**Model call metrics (ring buffer of recent calls, max 200 entries):** +Each entry: `{ timestamp: number, provider: string, latency: number, inputTokens: number, outputTokens: number, tokensPerSec: number, error?: string }` +- `recordModelCall(entry)` — push to ring buffer +- `getModelMetrics()` — return the array + +**Event stream (ring buffer of recent events, max 500 entries):** +Each entry: `{ timestamp: number, level: 'info' | 'warn' | 'error', source: string, message: string, context?: Record }` +- `recordEvent(event)` — push to ring buffer +- `getEvents(opts?: { level?: string, limit?: number })` — return filtered/limited array (newest first) + +**Active request tracking:** +- `startRequest(id: string, info: { sessionId: string, channel: string })` — records start time + info +- `endRequest(id: string)` — removes from active map +- `getActiveRequests()` — returns array of `{ id, sessionId, channel, startedAt, durationMs }` + +**Snapshot method:** +- `getSnapshot()` — returns `{ messagesProcessed, errors, activeRequests: number, uptime: number, modelCalls: { total, avgLatency, errorRate, recentCalls }, queueDepth: number }` +- Accept a `getQueueDepth` callback in constructor for LaneQueue integration + +The class should be simple, synchronous (no async), and have NO external dependencies beyond Node.js builtins. Export the class and all relevant types. + +**Wire MetricsCollector into the gateway:** + +1. In `src/gateway/server.ts`: + - Add `metrics?: MetricsCollector` to `GatewayServerConfig` interface + - Store the metrics instance on the GatewayServer class + - In `handleHttpRequest`, add a handler for `GET /health` BEFORE the auth check (health endpoint should be unauthenticated for Docker HEALTHCHECK). Return JSON: `{ status: 'ok', uptime: , version: , sessions: , connections: , tools: , channels: }`. Use the same data sources as `system.health` RPC handler. Set `Content-Type: application/json`. + - In the agent.send flow: the GatewayServer doesn't handle agent.send directly (it's in the handler), so instead expose `getMetrics()` accessor on GatewayServer so handlers can access it. + +2. In `src/gateway/handlers/system.ts`: + - Add `getMetrics?: () => { messagesProcessed: number, errors: number, activeRequests: number, uptime: number, modelCalls: { total: number, avgLatency: number, errorRate: number, recentCalls: unknown[] }, queueDepth: number }` to `SystemHandlerDeps` + - Add `getEvents?: () => unknown[]` and `getActiveRequests?: () => unknown[]` to `SystemHandlerDeps` + - Add `system.metrics` handler: returns `getMetrics()` snapshot + - Add `system.events` handler: returns `getEvents()` with optional `level` and `limit` params + - Add `system.activeRequests` handler: returns `getActiveRequests()` array + - Update the re-exports in `src/gateway/handlers/index.ts` if any new types need exporting + +3. In `src/gateway/handlers/agent.ts` (or via a wrapper in server.ts): + - The metrics recording for agent.send happens naturally. In `src/gateway/server.ts`, when registering handlers, wrap the system handlers construction to pass the metrics callbacks. The MetricsCollector is NOT directly imported by agent handler; instead, the GatewayServer passes metrics callbacks via SystemHandlerDeps. For request tracking in agent.send, add a `onRequestStart` and `onRequestEnd` callback to `AgentHandlerDeps` so the server can hook MetricsCollector in. + +4. In `src/daemon/services.ts`: + - In `createGateway()`, instantiate `new MetricsCollector({ getQueueDepth: () => 0 })` (queue depth from LaneQueue is internal to GatewayServer; we'll wire it there). + - Pass it to the GatewayServer config as `metrics`. + - Actually, better approach: let GatewayServer create the MetricsCollector itself in its constructor using its own LaneQueue. This keeps it self-contained. In `GatewayServerConfig`, just add `metricsEnabled?: boolean` (default true). The GatewayServer constructor creates `this.metrics = new MetricsCollector({ getQueueDepth: () => this.laneQueue.totalPending() })`. + - Add a `totalPending()` method to LaneQueue that sums all queue lengths across lanes. + +Wait — simpler approach that avoids changing too many files: +- Create MetricsCollector in GatewayServer constructor (it already has LaneQueue). No config change needed in services.ts. +- GatewayServer passes metrics callbacks to system handler deps and agent handler deps. +- This keeps the metrics concern entirely within the gateway module. + +Use this simpler approach. Changes to `src/daemon/services.ts` are minimal or unnecessary — just ensure the GatewayServer starts collecting metrics automatically. + +**Update LaneQueue** (`src/gateway/lane-queue.ts`): +- Add a `totalPending(): number` method that returns the sum of all lane queue lengths (iterate over lanes, sum queue.length). + +**Tests in `src/gateway/metrics.test.ts`:** +- Test counter increment/decrement +- Test model call ring buffer (max 200, FIFO eviction) +- Test event ring buffer (max 500, FIFO eviction, filtering by level) +- Test active request tracking (start, end, duration calculation) +- Test getSnapshot returns correct shape +- Test getEvents with level filter and limit + +Run `pnpm test:run` to verify zero regressions plus new tests pass. +Run `pnpm typecheck` to verify no type errors. + + +`pnpm test:run` — all existing 1077 tests pass plus new metrics tests pass. +`pnpm typecheck` — no type errors. +`grep -r "system.metrics\|system.events\|system.activeRequests" src/gateway/handlers/system.ts` — confirms new RPC methods exist. +`grep -r "GET.*health\|/health" src/gateway/server.ts` — confirms HTTP health endpoint exists. + + +MetricsCollector created with counters, model call ring buffer, event ring buffer, and active request tracking. +Three new RPC handlers registered (system.metrics, system.events, system.activeRequests). +GET /health returns unauthenticated JSON health status. +LaneQueue has totalPending() method. +All tests pass with zero regressions. + + + + + Task 2: Hook metrics recording into agent request flow + + src/gateway/server.ts + src/gateway/handlers/agent.ts + src/gateway/lane-queue.ts + + +Wire the MetricsCollector into the actual agent request flow so metrics are populated with real data as messages flow through the system. + +1. **In `src/gateway/server.ts` registerHandlers():** + - When creating agent handlers, pass `onRequestStart` and `onRequestEnd` callbacks that call `this.metrics.startRequest()` and `this.metrics.endRequest()` respectively. + - When creating agent handlers, pass `onRequestComplete` callback that calls `this.metrics.incrementMessages()` and optionally `this.metrics.incrementErrors()` on error. + - Pass `onModelCall` callback that the agent handler can call with latency/token data. + - Actually, the simpler pattern: pass the MetricsCollector instance directly to agent handler deps: `metrics?: MetricsCollector`. The agent handler can then call the methods directly. This is cleaner than a bag of callbacks. + +2. **In `src/gateway/handlers/agent.ts`:** + - Add `metrics?: MetricsCollector` to `AgentHandlerDeps` (import type from `../metrics.js`) + - In `agent.send` handler: + - At start: `const requestId = request.id.toString(); deps.metrics?.startRequest(requestId, { sessionId: laneId, channel: 'ws' });` + - In the `try` block after `agent.process()` resolves: `deps.metrics?.incrementMessages();` + - In the `catch` block: `deps.metrics?.incrementErrors(); deps.metrics?.recordEvent({ timestamp: Date.now(), level: 'error', source: 'agent.send', message: err.message || 'Unknown error', context: { sessionId: laneId } });` + - In `finally`: `deps.metrics?.endRequest(requestId);` + - For tool use events, record them to metrics: when `event.type === 'end'` and `event.result` and `!event.result.success`, increment error counter and record error event. + +3. **In `src/gateway/server.ts` registerHandlers():** + - Pass `metrics: this.metrics` when constructing agent handler deps. + - Update the system handlers construction to pass the metrics accessors: + ``` + getMetrics: () => this.metrics.getSnapshot(), + getEvents: (opts) => this.metrics.getEvents(opts), + getActiveRequests: () => this.metrics.getActiveRequests(), + ``` + +4. **Test the wiring:** + - In the existing `src/gateway/server.test.ts` or `src/gateway/handlers/handlers.test.ts`, verify that sending a message through agent.send increments the metrics counters. If the existing test infrastructure doesn't easily support this, at minimum verify through the type system that the wiring is correct. + +Run `pnpm test:run` and `pnpm typecheck`. + + +`pnpm test:run` — all tests pass (1077 existing + new metrics tests). +`pnpm typecheck` — no type errors. +`grep -r "metrics\." src/gateway/handlers/agent.ts` — confirms metrics calls in agent handler. + + +Agent request flow records: messagesProcessed counter, error counter, active request tracking, and error events. +Tool failures are recorded as error events. +System handlers return live metrics data from MetricsCollector. +All tests pass, no type errors. + + + + + + +1. `pnpm test:run` — all 1077+ tests pass +2. `pnpm typecheck` — zero type errors +3. New system.metrics, system.events, system.activeRequests RPC methods registered (check via getMethods()) +4. GET /health returns valid JSON with status, uptime, version fields +5. MetricsCollector ring buffers enforce size limits + + + +- MetricsCollector exists with counters, model call buffer, event buffer, active request tracking +- Three new RPC handlers return metrics data +- GET /health endpoint returns unauthenticated JSON health status +- Agent request flow records messagesProcessed, errors, active requests, and error events +- Zero test regressions, all new tests pass + + + +After completion, create `.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md` + diff --git a/.planning/phases/03-live-ops-dashboard/03-02-PLAN.md b/.planning/phases/03-live-ops-dashboard/03-02-PLAN.md new file mode 100644 index 0000000..57b2cbe --- /dev/null +++ b/.planning/phases/03-live-ops-dashboard/03-02-PLAN.md @@ -0,0 +1,260 @@ +--- +phase: 03-live-ops-dashboard +plan: 02 +type: execute +wave: 2 +depends_on: ["03-01"] +files_modified: + - src/gateway/ui/pages/dashboard.js + - src/gateway/ui/style.css + - src/gateway/ui/index.html + - src/gateway/ui/lib/ws-client.js +autonomous: false + +must_haves: + truths: + - "Dashboard shows live-updating counters: messages processed, active sessions, queue depth, daemon uptime — values change in real time" + - "Dashboard shows model call metrics: per-call latency, tokens/sec throughput, error rates by provider" + - "Dashboard shows live event stream: scrollable log of errors and events with timestamps, auto-scrolls on new entries" + - "Dashboard shows active request tracking: in-flight requests with duration and session info" + - "Dashboard auto-refreshes every 3 seconds for counters and events, maintaining live feel" + artifacts: + - path: "src/gateway/ui/pages/dashboard.js" + provides: "Enhanced dashboard page with metrics, events, and active request sections" + min_lines: 200 + - path: "src/gateway/ui/style.css" + provides: "New CSS classes for event stream, metrics cards, active requests table" + contains: "event-stream" + - path: "src/gateway/ui/index.html" + provides: "Unchanged structure (dashboard page already registered)" + - path: "src/gateway/ui/lib/ws-client.js" + provides: "No changes needed (call() method already supports the new RPC methods)" + key_links: + - from: "src/gateway/ui/pages/dashboard.js" + to: "system.metrics" + via: "client.call('system.metrics')" + pattern: "client\\.call.*system\\.metrics" + - from: "src/gateway/ui/pages/dashboard.js" + to: "system.events" + via: "client.call('system.events')" + pattern: "client\\.call.*system\\.events" + - from: "src/gateway/ui/pages/dashboard.js" + to: "system.activeRequests" + via: "client.call('system.activeRequests')" + pattern: "client\\.call.*system\\.activeRequests" +--- + + +Extend the existing vanilla JS dashboard with live ops sections: core counters, model call metrics, event stream, and active request tracking. + +Purpose: This is the user-facing deliverable — the operator opens the dashboard and sees real-time system health without tailing logs. All data comes from the RPC handlers created in Plan 01. + +Output: Enhanced dashboard.js with four new sections, supporting CSS, human-verified live dashboard. + + + +@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md +@/home/will/.config/opencode/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/STATE.md +@.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md +@src/gateway/ui/pages/dashboard.js +@src/gateway/ui/style.css +@src/gateway/ui/index.html +@src/gateway/ui/app.js +@src/gateway/ui/lib/ws-client.js + + + + + + Task 1: Extend dashboard page with live ops sections + + src/gateway/ui/pages/dashboard.js + src/gateway/ui/style.css + + +**IMPORTANT: Extend the existing vanilla JS dashboard — do NOT replace with React or any framework. This is a locked user decision.** + +Rewrite `src/gateway/ui/pages/dashboard.js` to show four sections (replacing the current simple health/channels/usage layout): + +**Section 1: Core Counters (top row of stat cards)** +- Messages Processed (from `system.metrics` → messagesProcessed) +- Active Sessions (from `system.health` → sessions) +- Queue Depth (from `system.metrics` → queueDepth) +- Daemon Uptime (from `system.metrics` → uptime, formatted as "Xd Xh Xm Xs") +- Active Requests (from `system.metrics` → activeRequests) +- Errors (from `system.metrics` → errors, colored red if > 0) + +Use the existing `.stats-grid` and `.stat-card` CSS classes. + +**Section 2: Model Performance (table of recent model calls)** +- Show the most recent 20 model calls from `system.metrics` → modelCalls.recentCalls +- Table columns: Time (relative, e.g. "3s ago"), Provider, Latency (ms), Tokens/sec, In/Out tokens, Status (✓ or ✗) +- Summary row above the table: Total calls, Avg latency, Error rate % +- Use existing table CSS classes + +**Section 3: Event Stream (scrollable log)** +- Fetch from `system.events` with `{ limit: 50 }` +- Each event rendered as a row: `[HH:MM:SS] [LEVEL] source: message` +- Color-code: error=red, warn=yellow, info=default +- Container has max-height with overflow-y: auto and auto-scrolls to bottom on new entries +- New class `.event-stream` for the container, `.event-row` for each entry, `.event-level-error`, `.event-level-warn`, `.event-level-info` for coloring + +**Section 4: Active Requests (table, only shown when requests in flight)** +- Fetch from `system.activeRequests` +- Table columns: Session, Channel, Duration (live-updating), Started +- If no active requests, show "No active requests" muted text +- Use existing table CSS + +**Section 5: Channels (keep existing)** +- Keep the existing channels grid showing connected/disconnected channel adapters + +**Refresh strategy:** +- Replace the current 10-second interval with a 3-second interval for the core data (system.metrics, system.events, system.activeRequests) +- Fetch system.health and system.channels every 10 seconds (less dynamic data) +- Use `Promise.all` to batch the frequent calls together +- Keep the existing `teardown()` pattern with `clearInterval` + +**Implementation approach:** +- Keep the same module pattern: `loadDashboard(el, client)` function + `DashboardPage` export with `render`/`teardown` +- Use two timers: `_fastTimer` (3s) for metrics/events/requests, `_slowTimer` (10s) for health/channels +- On first render, fetch everything with `Promise.all` +- On subsequent fast ticks, only update the dynamic sections (don't re-render the whole page — use targeted DOM updates via `getElementById` for each section) +- Generate unique section IDs: `#ops-counters`, `#ops-model-table`, `#ops-events`, `#ops-requests`, `#ops-channels` + +**CSS additions in `src/gateway/ui/style.css`:** +Add at the end of the file (before the responsive section): + +```css +/* ── Event Stream ──────────────────────────────────────── */ +.event-stream { + max-height: 300px; + overflow-y: auto; + background-color: var(--bg-secondary); + border: 1px solid var(--border); + border-radius: var(--radius); + padding: 8px; + font-size: var(--font-size-sm); + font-family: var(--font-mono); +} + +.event-row { + padding: 4px 8px; + border-bottom: 1px solid var(--border-light); + white-space: pre-wrap; + word-break: break-word; +} + +.event-row:last-child { + border-bottom: none; +} + +.event-level-error { color: var(--error); } +.event-level-warn { color: var(--warning); } +.event-level-info { color: var(--text-secondary); } + +/* ── Model Metrics Summary ─────────────────────────────── */ +.metrics-summary { + display: flex; + gap: 24px; + margin-bottom: 12px; + font-size: var(--font-size-sm); + color: var(--text-secondary); +} + +.metrics-summary .metric { + display: flex; + gap: 6px; +} + +.metrics-summary .metric-value { + font-weight: 600; + color: var(--text-primary); +} +``` + +**Keep the formatUptime helper** — it already exists and works perfectly. + +**Avoid:** Do NOT add animations or transitions. Do NOT import external libraries. Do NOT use template literals with innerHTML for the fast-update path — use targeted textContent/innerHTML updates on specific elements to avoid flicker. + + +`pnpm typecheck` — no type errors (vanilla JS won't affect this, but ensures no TS regressions). +`pnpm build` — builds successfully (UI files are served as static assets, not compiled). +Manual check: Open `src/gateway/ui/pages/dashboard.js` and verify it: + - Calls `client.call('system.metrics')` + - Calls `client.call('system.events')` + - Calls `client.call('system.activeRequests')` + - Has 3-second and 10-second refresh timers + - Has `teardown()` that cleans up both timers + + +Dashboard page shows five sections: core counters, model performance table, event stream, active requests, and channels. +Counters and events refresh every 3 seconds. +Health and channels refresh every 10 seconds. +Event stream auto-scrolls and is color-coded by level. +Active requests section shows in-flight requests or "no active requests" message. +All existing stat-card and table CSS reused; new event-stream CSS added. + + + + + Task 2: Verify live dashboard in browser + src/gateway/ui/pages/dashboard.js + +Human verification of the live dashboard. What was built: +- Live ops dashboard with real-time metrics, event stream, model performance table, active request tracking, and HTTP /health endpoint +- Extended the existing vanilla JS dashboard (no framework replacement) + +Steps to verify: +1. Start Flynn: `pnpm dev` +2. Open the dashboard in a browser (default: http://localhost:3100 or configured port) +3. Verify the dashboard shows: + - Core counters row: Messages Processed, Active Sessions, Queue Depth, Uptime, Active Requests, Errors + - Model Performance section: table of recent model calls (may be empty if no messages sent yet) + - Event Stream section: scrollable log (may show startup events) + - Active Requests section: "No active requests" or table + - Channels section: connected channel adapters +4. Send a message through the chat page (or via a connected channel) and verify: + - Messages Processed counter increments within 3 seconds + - Model Performance table shows the new call with latency and tokens/sec + - Event stream shows relevant entries +5. Trigger an error (e.g., send a message that causes a tool error) and verify it appears in the event stream in red +6. Test HTTP /health: `curl http://localhost:3100/health` — should return JSON with status, uptime, version +7. Run `pnpm test:run` — all tests pass + +Resume signal: Type "approved" or describe issues. + + Human confirms dashboard displays correctly and updates in real-time. + Dashboard visually confirmed working with live-updating metrics, event stream, and model performance data. + + + + + +1. Dashboard loads without errors in browser console +2. All five sections render with real data +3. Counters update within 3 seconds of events occurring +4. Event stream is scrollable and color-coded +5. `curl /health` returns valid JSON +6. `pnpm test:run` — all tests pass +7. `pnpm typecheck` — zero type errors + + + +- Dashboard shows live-updating counters that change as messages flow (DASH-01) +- Model call metrics visible with latency and tokens/sec (DASH-02) +- Event stream shows errors with timestamps and context (DASH-03) +- Active requests tracked and displayed (DASH-04) +- GET /health returns JSON status (DASH-05) +- Existing dashboard pages (chat, sessions, usage, settings) unaffected +- Zero test regressions + + + +After completion, create `.planning/phases/03-live-ops-dashboard/03-02-SUMMARY.md` +