docs(03-01): complete metrics collection backend plan

- SUMMARY.md with task commits, decisions, self-check
- STATE.md updated: phase 3 in_progress, 1/2 plans, test count 1107
This commit is contained in:
William Valentin
2026-02-09 21:31:07 -08:00
parent a0feff9637
commit 982dcee5e0
2 changed files with 133 additions and 10 deletions
+21 -10
View File
@@ -8,10 +8,10 @@
## Current Position
**Phase:** 2Config Overlays
**Plan:** 2 of 2 complete (02-02 done)
**Status:** complete
**Progress:** ███████░░░ 2/3 phases complete
**Phase:** 3Live Ops Dashboard
**Plan:** 1 of 2 complete (03-01 done)
**Status:** in_progress
**Progress:** ██████████ 2.5/3 phases (Phase 3: 1/2 plans)
## Phase Status
@@ -19,13 +19,13 @@
|-------|--------|-------|
| 1 — Daemon Decomposition | **complete** | 3/3 plans complete |
| 2 — Config Overlays | **complete** | 2/2 plans complete |
| 3 — Live Ops Dashboard | not_started | — |
| 3 — Live Ops Dashboard | **in_progress** | 1/2 plans complete |
## Performance Metrics
| Metric | Value |
|--------|-------|
| Test count | 1077 (baseline, verified across all plans) |
| Test count | 1107 (verified after 03-01, +20 metrics tests from 1087 baseline) |
| daemon/index.ts lines | 140 (from 1087 baseline, -87%) |
| Total daemon modules | 9 files, 1271 lines |
| Plan 01-01 duration | 9 min |
@@ -38,6 +38,8 @@
| Plan 02-01 tasks | 2/2 |
| Plan 02-02 duration | ~1 min |
| Plan 02-02 tasks | 1/1 |
| Plan 03-01 duration | ~2 min |
| Plan 03-01 tasks | 2/2 |
## Accumulated Context
@@ -61,6 +63,12 @@
- Doctor overlay check placed after checkConfigExists, before checkConfigParses — natural validation order
- Skip status when FLYNN_ENV not set — no noise for users without overlays
- MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed)
- Ring buffers: 200 model calls, 500 events — configurable, bounded memory for long-running daemon
- Metrics passed as optional MetricsCollector instance to agent handler deps (not individual callbacks)
- startRequest before laneQueue.enqueue, endRequest in finally — tracks full queuing + execution time
- /health endpoint unauthenticated, placed before auth check for Docker HEALTHCHECK compatibility
### Technical Notes
- daemon/index.ts now 140 lines — thin composition root: imports → init calls → wire → return DaemonContext
- 8 extracted modules: models.ts (251), memory.ts (99), tools.ts (89), routing.ts (239), agents.ts (48), channels.ts (102), services.ts (269), lifecycle.ts (34)
@@ -71,6 +79,9 @@
- deepMerge + overlay-aware loadConfig in loader.ts; resolveOverlayPath + overlay-aware loadConfigSafe in cli/shared.ts
- FLYNN_ENV maps to {configDir}/{env}.yaml sibling file; no env = no overlay (backward compatible)
- checkOverlayExists in doctor.ts: skip (no FLYNN_ENV) / pass (file found) / fail (file missing)
- MetricsCollector at src/gateway/metrics.ts: counters, model call ring buffer (200), event ring buffer (500), active request tracking
- 3 new RPC handlers: system.metrics (snapshot), system.events (filtered/limited), system.activeRequests (in-flight details)
- GET /health HTTP endpoint returns JSON status without auth (for monitoring/healthcheck)
### TODOs
_(none)_
@@ -80,10 +91,10 @@ _(none)_
## Session Continuity
**Last session:** Plan 02-02 (doctor overlay validation) completed
**Stopped at:** Completed 02-02-PLAN.md — Phase 2 complete
**Next action:** Plan Phase 3 (Live Ops Dashboard)
**Last session:** Plan 03-01 (metrics collection backend) completed
**Stopped at:** Completed 03-01-PLAN.md — Phase 3 plan 1 of 2 done
**Next action:** Execute 03-02-PLAN.md (Dashboard UI)
---
*State initialized: 2026-02-09*
*Last updated: 2026-02-10*
*Last updated: 2026-02-10T05:29Z*
@@ -0,0 +1,112 @@
---
phase: 03-live-ops-dashboard
plan: 01
subsystem: gateway
tags: [metrics, ring-buffer, rpc, health-endpoint, monitoring]
# Dependency graph
requires:
- phase: 01-daemon-decomposition
provides: GatewayServer, LaneQueue, handler architecture, session bridge
provides:
- MetricsCollector class with counters, model call ring buffer, event ring buffer, active request tracking
- system.metrics, system.events, system.activeRequests RPC handlers
- GET /health unauthenticated HTTP endpoint
- Metrics recording wired into agent.send request flow
affects: [03-02-PLAN (dashboard UI consumes these RPC methods)]
# Tech tracking
tech-stack:
added: []
patterns: [ring-buffer with FIFO eviction, optional-chaining metrics injection, gauge counters]
key-files:
created:
- src/gateway/metrics.ts
- src/gateway/metrics.test.ts
modified:
- src/gateway/server.ts
- src/gateway/handlers/system.ts
- src/gateway/handlers/agent.ts
- src/gateway/lane-queue.ts
key-decisions:
- "MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed)"
- "Ring buffers: 200 model calls, 500 events — reasonable for dashboard display without unbounded growth"
- "Metrics passed to agent handler as optional MetricsCollector instance (not individual callbacks)"
- "startRequest called before laneQueue.enqueue, endRequest in finally block — tracks full queuing + execution time"
patterns-established:
- "Optional metrics injection: deps.metrics?.method() pattern for zero-cost when metrics disabled"
- "Ring buffer with shift() eviction for bounded memory in long-running daemon"
- "Unauthenticated /health endpoint before auth check for Docker HEALTHCHECK compatibility"
# Metrics
duration: 2min
completed: 2026-02-10
---
# Phase 3 Plan 1: Metrics Collection Backend Summary
**MetricsCollector with counters, ring buffers, and active request tracking, exposed via 3 RPC handlers and /health HTTP endpoint, wired into agent.send flow**
## Performance
- **Duration:** ~2 min
- **Started:** 2026-02-10T05:27:59Z
- **Completed:** 2026-02-10T05:29:33Z
- **Tasks:** 2/2
- **Files modified:** 6
## Accomplishments
- MetricsCollector class tracking messages processed, errors, active requests, model call latency, and event stream
- Three new RPC handlers (system.metrics, system.events, system.activeRequests) for dashboard consumption
- GET /health unauthenticated endpoint returning JSON status for Docker HEALTHCHECK
- Agent request flow records metrics: message counts, error events, tool failure events, active request tracking
## Task Commits
Each task was committed atomically:
1. **Task 1: Create MetricsCollector and wire into gateway** - `bd1880a` (feat)
2. **Task 2: Hook metrics recording into agent request flow** - `a0feff9` (feat)
## Files Created/Modified
- `src/gateway/metrics.ts` - MetricsCollector class with counters, ring buffers, active request map, snapshot method
- `src/gateway/metrics.test.ts` - 20 tests covering counters, ring buffer limits, event filtering, active request tracking, snapshot shape
- `src/gateway/server.ts` - MetricsCollector creation in constructor, /health HTTP endpoint, metrics callbacks to handlers
- `src/gateway/handlers/system.ts` - system.metrics, system.events, system.activeRequests RPC handlers
- `src/gateway/handlers/agent.ts` - Metrics recording in agent.send: startRequest/endRequest, message/error counters, error events, tool failure events
- `src/gateway/lane-queue.ts` - totalPending() method for queue depth metric
## Decisions Made
- MetricsCollector self-contained in GatewayServer constructor — no changes to services.ts needed
- Ring buffer sizes: 200 model calls, 500 events (configurable via constructor)
- Passed MetricsCollector instance directly to agent handler deps instead of individual callbacks — cleaner API
- startRequest called before laneQueue.enqueue to track full queuing + execution duration
- Tool failures recorded as separate error events with tool name context
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- All metrics RPC endpoints ready for Plan 02 (Dashboard UI) to consume
- system.metrics returns snapshot with counters, model call stats, queue depth
- system.events returns filtered/limited events (newest first)
- system.activeRequests returns in-flight request details
- GET /health available for external monitoring integration
## Self-Check: PASSED
All 7 files verified present. Both task commits (bd1880a, a0feff9) verified in git log.
---
*Phase: 03-live-ops-dashboard*
*Completed: 2026-02-10*