Files
flynn/.planning/phases/03-live-ops-dashboard/03-02-PLAN.md
T
2026-02-09 21:10:03 -08:00

261 lines
11 KiB
Markdown

---
phase: 03-live-ops-dashboard
plan: 02
type: execute
wave: 2
depends_on: ["03-01"]
files_modified:
- src/gateway/ui/pages/dashboard.js
- src/gateway/ui/style.css
- src/gateway/ui/index.html
- src/gateway/ui/lib/ws-client.js
autonomous: false
must_haves:
truths:
- "Dashboard shows live-updating counters: messages processed, active sessions, queue depth, daemon uptime — values change in real time"
- "Dashboard shows model call metrics: per-call latency, tokens/sec throughput, error rates by provider"
- "Dashboard shows live event stream: scrollable log of errors and events with timestamps, auto-scrolls on new entries"
- "Dashboard shows active request tracking: in-flight requests with duration and session info"
- "Dashboard auto-refreshes every 3 seconds for counters and events, maintaining live feel"
artifacts:
- path: "src/gateway/ui/pages/dashboard.js"
provides: "Enhanced dashboard page with metrics, events, and active request sections"
min_lines: 200
- path: "src/gateway/ui/style.css"
provides: "New CSS classes for event stream, metrics cards, active requests table"
contains: "event-stream"
- path: "src/gateway/ui/index.html"
provides: "Unchanged structure (dashboard page already registered)"
- path: "src/gateway/ui/lib/ws-client.js"
provides: "No changes needed (call() method already supports the new RPC methods)"
key_links:
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.metrics"
via: "client.call('system.metrics')"
pattern: "client\\.call.*system\\.metrics"
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.events"
via: "client.call('system.events')"
pattern: "client\\.call.*system\\.events"
- from: "src/gateway/ui/pages/dashboard.js"
to: "system.activeRequests"
via: "client.call('system.activeRequests')"
pattern: "client\\.call.*system\\.activeRequests"
---
<objective>
Extend the existing vanilla JS dashboard with live ops sections: core counters, model call metrics, event stream, and active request tracking.
Purpose: This is the user-facing deliverable — the operator opens the dashboard and sees real-time system health without tailing logs. All data comes from the RPC handlers created in Plan 01.
Output: Enhanced dashboard.js with four new sections, supporting CSS, human-verified live dashboard.
</objective>
<execution_context>
@/home/will/.config/opencode/get-shit-done/workflows/execute-plan.md
@/home/will/.config/opencode/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md
@src/gateway/ui/pages/dashboard.js
@src/gateway/ui/style.css
@src/gateway/ui/index.html
@src/gateway/ui/app.js
@src/gateway/ui/lib/ws-client.js
</context>
<tasks>
<task type="auto">
<name>Task 1: Extend dashboard page with live ops sections</name>
<files>
src/gateway/ui/pages/dashboard.js
src/gateway/ui/style.css
</files>
<action>
**IMPORTANT: Extend the existing vanilla JS dashboard — do NOT replace with React or any framework. This is a locked user decision.**
Rewrite `src/gateway/ui/pages/dashboard.js` to show four sections (replacing the current simple health/channels/usage layout):
**Section 1: Core Counters (top row of stat cards)**
- Messages Processed (from `system.metrics` → messagesProcessed)
- Active Sessions (from `system.health` → sessions)
- Queue Depth (from `system.metrics` → queueDepth)
- Daemon Uptime (from `system.metrics` → uptime, formatted as "Xd Xh Xm Xs")
- Active Requests (from `system.metrics` → activeRequests)
- Errors (from `system.metrics` → errors, colored red if > 0)
Use the existing `.stats-grid` and `.stat-card` CSS classes.
**Section 2: Model Performance (table of recent model calls)**
- Show the most recent 20 model calls from `system.metrics` → modelCalls.recentCalls
- Table columns: Time (relative, e.g. "3s ago"), Provider, Latency (ms), Tokens/sec, In/Out tokens, Status (✓ or ✗)
- Summary row above the table: Total calls, Avg latency, Error rate %
- Use existing table CSS classes
**Section 3: Event Stream (scrollable log)**
- Fetch from `system.events` with `{ limit: 50 }`
- Each event rendered as a row: `[HH:MM:SS] [LEVEL] source: message`
- Color-code: error=red, warn=yellow, info=default
- Container has max-height with overflow-y: auto and auto-scrolls to bottom on new entries
- New class `.event-stream` for the container, `.event-row` for each entry, `.event-level-error`, `.event-level-warn`, `.event-level-info` for coloring
**Section 4: Active Requests (table, only shown when requests in flight)**
- Fetch from `system.activeRequests`
- Table columns: Session, Channel, Duration (live-updating), Started
- If no active requests, show "No active requests" muted text
- Use existing table CSS
**Section 5: Channels (keep existing)**
- Keep the existing channels grid showing connected/disconnected channel adapters
**Refresh strategy:**
- Replace the current 10-second interval with a 3-second interval for the core data (system.metrics, system.events, system.activeRequests)
- Fetch system.health and system.channels every 10 seconds (less dynamic data)
- Use `Promise.all` to batch the frequent calls together
- Keep the existing `teardown()` pattern with `clearInterval`
**Implementation approach:**
- Keep the same module pattern: `loadDashboard(el, client)` function + `DashboardPage` export with `render`/`teardown`
- Use two timers: `_fastTimer` (3s) for metrics/events/requests, `_slowTimer` (10s) for health/channels
- On first render, fetch everything with `Promise.all`
- On subsequent fast ticks, only update the dynamic sections (don't re-render the whole page — use targeted DOM updates via `getElementById` for each section)
- Generate unique section IDs: `#ops-counters`, `#ops-model-table`, `#ops-events`, `#ops-requests`, `#ops-channels`
**CSS additions in `src/gateway/ui/style.css`:**
Add at the end of the file (before the responsive section):
```css
/* ── Event Stream ──────────────────────────────────────── */
.event-stream {
max-height: 300px;
overflow-y: auto;
background-color: var(--bg-secondary);
border: 1px solid var(--border);
border-radius: var(--radius);
padding: 8px;
font-size: var(--font-size-sm);
font-family: var(--font-mono);
}
.event-row {
padding: 4px 8px;
border-bottom: 1px solid var(--border-light);
white-space: pre-wrap;
word-break: break-word;
}
.event-row:last-child {
border-bottom: none;
}
.event-level-error { color: var(--error); }
.event-level-warn { color: var(--warning); }
.event-level-info { color: var(--text-secondary); }
/* ── Model Metrics Summary ─────────────────────────────── */
.metrics-summary {
display: flex;
gap: 24px;
margin-bottom: 12px;
font-size: var(--font-size-sm);
color: var(--text-secondary);
}
.metrics-summary .metric {
display: flex;
gap: 6px;
}
.metrics-summary .metric-value {
font-weight: 600;
color: var(--text-primary);
}
```
**Keep the formatUptime helper** — it already exists and works perfectly.
**Avoid:** Do NOT add animations or transitions. Do NOT import external libraries. Do NOT use template literals with innerHTML for the fast-update path — use targeted textContent/innerHTML updates on specific elements to avoid flicker.
</action>
<verify>
`pnpm typecheck` — no type errors (vanilla JS won't affect this, but ensures no TS regressions).
`pnpm build` — builds successfully (UI files are served as static assets, not compiled).
Manual check: Open `src/gateway/ui/pages/dashboard.js` and verify it:
- Calls `client.call('system.metrics')`
- Calls `client.call('system.events')`
- Calls `client.call('system.activeRequests')`
- Has 3-second and 10-second refresh timers
- Has `teardown()` that cleans up both timers
</verify>
<done>
Dashboard page shows five sections: core counters, model performance table, event stream, active requests, and channels.
Counters and events refresh every 3 seconds.
Health and channels refresh every 10 seconds.
Event stream auto-scrolls and is color-coded by level.
Active requests section shows in-flight requests or "no active requests" message.
All existing stat-card and table CSS reused; new event-stream CSS added.
</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<name>Task 2: Verify live dashboard in browser</name>
<files>src/gateway/ui/pages/dashboard.js</files>
<action>
Human verification of the live dashboard. What was built:
- Live ops dashboard with real-time metrics, event stream, model performance table, active request tracking, and HTTP /health endpoint
- Extended the existing vanilla JS dashboard (no framework replacement)
Steps to verify:
1. Start Flynn: `pnpm dev`
2. Open the dashboard in a browser (default: http://localhost:3100 or configured port)
3. Verify the dashboard shows:
- Core counters row: Messages Processed, Active Sessions, Queue Depth, Uptime, Active Requests, Errors
- Model Performance section: table of recent model calls (may be empty if no messages sent yet)
- Event Stream section: scrollable log (may show startup events)
- Active Requests section: "No active requests" or table
- Channels section: connected channel adapters
4. Send a message through the chat page (or via a connected channel) and verify:
- Messages Processed counter increments within 3 seconds
- Model Performance table shows the new call with latency and tokens/sec
- Event stream shows relevant entries
5. Trigger an error (e.g., send a message that causes a tool error) and verify it appears in the event stream in red
6. Test HTTP /health: `curl http://localhost:3100/health` — should return JSON with status, uptime, version
7. Run `pnpm test:run` — all tests pass
Resume signal: Type "approved" or describe issues.
</action>
<verify>Human confirms dashboard displays correctly and updates in real-time.</verify>
<done>Dashboard visually confirmed working with live-updating metrics, event stream, and model performance data.</done>
</task>
</tasks>
<verification>
1. Dashboard loads without errors in browser console
2. All five sections render with real data
3. Counters update within 3 seconds of events occurring
4. Event stream is scrollable and color-coded
5. `curl /health` returns valid JSON
6. `pnpm test:run` — all tests pass
7. `pnpm typecheck` — zero type errors
</verification>
<success_criteria>
- Dashboard shows live-updating counters that change as messages flow (DASH-01)
- Model call metrics visible with latency and tokens/sec (DASH-02)
- Event stream shows errors with timestamps and context (DASH-03)
- Active requests tracked and displayed (DASH-04)
- GET /health returns JSON status (DASH-05)
- Existing dashboard pages (chat, sessions, usage, settings) unaffected
- Zero test regressions
</success_criteria>
<output>
After completion, create `.planning/phases/03-live-ops-dashboard/03-02-SUMMARY.md`
</output>