diff --git a/.planning/STATE.md b/.planning/STATE.md index 14bb889..20fddf8 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -8,10 +8,10 @@ ## Current Position -**Phase:** 2 — Config Overlays -**Plan:** 2 of 2 complete (02-02 done) -**Status:** complete -**Progress:** ███████░░░ 2/3 phases complete +**Phase:** 3 — Live Ops Dashboard +**Plan:** 1 of 2 complete (03-01 done) +**Status:** in_progress +**Progress:** ██████████ 2.5/3 phases (Phase 3: 1/2 plans) ## Phase Status @@ -19,13 +19,13 @@ |-------|--------|-------| | 1 — Daemon Decomposition | **complete** | 3/3 plans complete | | 2 — Config Overlays | **complete** | 2/2 plans complete | -| 3 — Live Ops Dashboard | not_started | — | +| 3 — Live Ops Dashboard | **in_progress** | 1/2 plans complete | ## Performance Metrics | Metric | Value | |--------|-------| -| Test count | 1077 (baseline, verified across all plans) | +| Test count | 1107 (verified after 03-01, +20 metrics tests from 1087 baseline) | | daemon/index.ts lines | 140 (from 1087 baseline, -87%) | | Total daemon modules | 9 files, 1271 lines | | Plan 01-01 duration | 9 min | @@ -38,6 +38,8 @@ | Plan 02-01 tasks | 2/2 | | Plan 02-02 duration | ~1 min | | Plan 02-02 tasks | 1/1 | +| Plan 03-01 duration | ~2 min | +| Plan 03-01 tasks | 2/2 | ## Accumulated Context @@ -61,6 +63,12 @@ - Doctor overlay check placed after checkConfigExists, before checkConfigParses — natural validation order - Skip status when FLYNN_ENV not set — no noise for users without overlays +- MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed) +- Ring buffers: 200 model calls, 500 events — configurable, bounded memory for long-running daemon +- Metrics passed as optional MetricsCollector instance to agent handler deps (not individual callbacks) +- startRequest before laneQueue.enqueue, endRequest in finally — tracks full queuing + execution time +- /health endpoint unauthenticated, placed before auth check for Docker HEALTHCHECK compatibility + ### Technical Notes - daemon/index.ts now 140 lines — thin composition root: imports → init calls → wire → return DaemonContext - 8 extracted modules: models.ts (251), memory.ts (99), tools.ts (89), routing.ts (239), agents.ts (48), channels.ts (102), services.ts (269), lifecycle.ts (34) @@ -71,6 +79,9 @@ - deepMerge + overlay-aware loadConfig in loader.ts; resolveOverlayPath + overlay-aware loadConfigSafe in cli/shared.ts - FLYNN_ENV maps to {configDir}/{env}.yaml sibling file; no env = no overlay (backward compatible) - checkOverlayExists in doctor.ts: skip (no FLYNN_ENV) / pass (file found) / fail (file missing) +- MetricsCollector at src/gateway/metrics.ts: counters, model call ring buffer (200), event ring buffer (500), active request tracking +- 3 new RPC handlers: system.metrics (snapshot), system.events (filtered/limited), system.activeRequests (in-flight details) +- GET /health HTTP endpoint returns JSON status without auth (for monitoring/healthcheck) ### TODOs _(none)_ @@ -80,10 +91,10 @@ _(none)_ ## Session Continuity -**Last session:** Plan 02-02 (doctor overlay validation) completed -**Stopped at:** Completed 02-02-PLAN.md — Phase 2 complete -**Next action:** Plan Phase 3 (Live Ops Dashboard) +**Last session:** Plan 03-01 (metrics collection backend) completed +**Stopped at:** Completed 03-01-PLAN.md — Phase 3 plan 1 of 2 done +**Next action:** Execute 03-02-PLAN.md (Dashboard UI) --- *State initialized: 2026-02-09* -*Last updated: 2026-02-10* +*Last updated: 2026-02-10T05:29Z* diff --git a/.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md b/.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md new file mode 100644 index 0000000..c0019c4 --- /dev/null +++ b/.planning/phases/03-live-ops-dashboard/03-01-SUMMARY.md @@ -0,0 +1,112 @@ +--- +phase: 03-live-ops-dashboard +plan: 01 +subsystem: gateway +tags: [metrics, ring-buffer, rpc, health-endpoint, monitoring] + +# Dependency graph +requires: + - phase: 01-daemon-decomposition + provides: GatewayServer, LaneQueue, handler architecture, session bridge +provides: + - MetricsCollector class with counters, model call ring buffer, event ring buffer, active request tracking + - system.metrics, system.events, system.activeRequests RPC handlers + - GET /health unauthenticated HTTP endpoint + - Metrics recording wired into agent.send request flow +affects: [03-02-PLAN (dashboard UI consumes these RPC methods)] + +# Tech tracking +tech-stack: + added: [] + patterns: [ring-buffer with FIFO eviction, optional-chaining metrics injection, gauge counters] + +key-files: + created: + - src/gateway/metrics.ts + - src/gateway/metrics.test.ts + modified: + - src/gateway/server.ts + - src/gateway/handlers/system.ts + - src/gateway/handlers/agent.ts + - src/gateway/lane-queue.ts + +key-decisions: + - "MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed)" + - "Ring buffers: 200 model calls, 500 events — reasonable for dashboard display without unbounded growth" + - "Metrics passed to agent handler as optional MetricsCollector instance (not individual callbacks)" + - "startRequest called before laneQueue.enqueue, endRequest in finally block — tracks full queuing + execution time" + +patterns-established: + - "Optional metrics injection: deps.metrics?.method() pattern for zero-cost when metrics disabled" + - "Ring buffer with shift() eviction for bounded memory in long-running daemon" + - "Unauthenticated /health endpoint before auth check for Docker HEALTHCHECK compatibility" + +# Metrics +duration: 2min +completed: 2026-02-10 +--- + +# Phase 3 Plan 1: Metrics Collection Backend Summary + +**MetricsCollector with counters, ring buffers, and active request tracking, exposed via 3 RPC handlers and /health HTTP endpoint, wired into agent.send flow** + +## Performance + +- **Duration:** ~2 min +- **Started:** 2026-02-10T05:27:59Z +- **Completed:** 2026-02-10T05:29:33Z +- **Tasks:** 2/2 +- **Files modified:** 6 + +## Accomplishments +- MetricsCollector class tracking messages processed, errors, active requests, model call latency, and event stream +- Three new RPC handlers (system.metrics, system.events, system.activeRequests) for dashboard consumption +- GET /health unauthenticated endpoint returning JSON status for Docker HEALTHCHECK +- Agent request flow records metrics: message counts, error events, tool failure events, active request tracking + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Create MetricsCollector and wire into gateway** - `bd1880a` (feat) +2. **Task 2: Hook metrics recording into agent request flow** - `a0feff9` (feat) + +## Files Created/Modified +- `src/gateway/metrics.ts` - MetricsCollector class with counters, ring buffers, active request map, snapshot method +- `src/gateway/metrics.test.ts` - 20 tests covering counters, ring buffer limits, event filtering, active request tracking, snapshot shape +- `src/gateway/server.ts` - MetricsCollector creation in constructor, /health HTTP endpoint, metrics callbacks to handlers +- `src/gateway/handlers/system.ts` - system.metrics, system.events, system.activeRequests RPC handlers +- `src/gateway/handlers/agent.ts` - Metrics recording in agent.send: startRequest/endRequest, message/error counters, error events, tool failure events +- `src/gateway/lane-queue.ts` - totalPending() method for queue depth metric + +## Decisions Made +- MetricsCollector self-contained in GatewayServer constructor — no changes to services.ts needed +- Ring buffer sizes: 200 model calls, 500 events (configurable via constructor) +- Passed MetricsCollector instance directly to agent handler deps instead of individual callbacks — cleaner API +- startRequest called before laneQueue.enqueue to track full queuing + execution duration +- Tool failures recorded as separate error events with tool name context + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered +None + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- All metrics RPC endpoints ready for Plan 02 (Dashboard UI) to consume +- system.metrics returns snapshot with counters, model call stats, queue depth +- system.events returns filtered/limited events (newest first) +- system.activeRequests returns in-flight request details +- GET /health available for external monitoring integration + +## Self-Check: PASSED + +All 7 files verified present. Both task commits (bd1880a, a0feff9) verified in git log. + +--- +*Phase: 03-live-ops-dashboard* +*Completed: 2026-02-10*