docs(03-01): complete metrics collection backend plan

- SUMMARY.md with task commits, decisions, self-check
- STATE.md updated: phase 3 in_progress, 1/2 plans, test count 1107
This commit is contained in:
William Valentin
2026-02-09 21:31:07 -08:00
parent a0feff9637
commit 982dcee5e0
2 changed files with 133 additions and 10 deletions
@@ -0,0 +1,112 @@
---
phase: 03-live-ops-dashboard
plan: 01
subsystem: gateway
tags: [metrics, ring-buffer, rpc, health-endpoint, monitoring]
# Dependency graph
requires:
- phase: 01-daemon-decomposition
provides: GatewayServer, LaneQueue, handler architecture, session bridge
provides:
- MetricsCollector class with counters, model call ring buffer, event ring buffer, active request tracking
- system.metrics, system.events, system.activeRequests RPC handlers
- GET /health unauthenticated HTTP endpoint
- Metrics recording wired into agent.send request flow
affects: [03-02-PLAN (dashboard UI consumes these RPC methods)]
# Tech tracking
tech-stack:
added: []
patterns: [ring-buffer with FIFO eviction, optional-chaining metrics injection, gauge counters]
key-files:
created:
- src/gateway/metrics.ts
- src/gateway/metrics.test.ts
modified:
- src/gateway/server.ts
- src/gateway/handlers/system.ts
- src/gateway/handlers/agent.ts
- src/gateway/lane-queue.ts
key-decisions:
- "MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed)"
- "Ring buffers: 200 model calls, 500 events — reasonable for dashboard display without unbounded growth"
- "Metrics passed to agent handler as optional MetricsCollector instance (not individual callbacks)"
- "startRequest called before laneQueue.enqueue, endRequest in finally block — tracks full queuing + execution time"
patterns-established:
- "Optional metrics injection: deps.metrics?.method() pattern for zero-cost when metrics disabled"
- "Ring buffer with shift() eviction for bounded memory in long-running daemon"
- "Unauthenticated /health endpoint before auth check for Docker HEALTHCHECK compatibility"
# Metrics
duration: 2min
completed: 2026-02-10
---
# Phase 3 Plan 1: Metrics Collection Backend Summary
**MetricsCollector with counters, ring buffers, and active request tracking, exposed via 3 RPC handlers and /health HTTP endpoint, wired into agent.send flow**
## Performance
- **Duration:** ~2 min
- **Started:** 2026-02-10T05:27:59Z
- **Completed:** 2026-02-10T05:29:33Z
- **Tasks:** 2/2
- **Files modified:** 6
## Accomplishments
- MetricsCollector class tracking messages processed, errors, active requests, model call latency, and event stream
- Three new RPC handlers (system.metrics, system.events, system.activeRequests) for dashboard consumption
- GET /health unauthenticated endpoint returning JSON status for Docker HEALTHCHECK
- Agent request flow records metrics: message counts, error events, tool failure events, active request tracking
## Task Commits
Each task was committed atomically:
1. **Task 1: Create MetricsCollector and wire into gateway** - `bd1880a` (feat)
2. **Task 2: Hook metrics recording into agent request flow** - `a0feff9` (feat)
## Files Created/Modified
- `src/gateway/metrics.ts` - MetricsCollector class with counters, ring buffers, active request map, snapshot method
- `src/gateway/metrics.test.ts` - 20 tests covering counters, ring buffer limits, event filtering, active request tracking, snapshot shape
- `src/gateway/server.ts` - MetricsCollector creation in constructor, /health HTTP endpoint, metrics callbacks to handlers
- `src/gateway/handlers/system.ts` - system.metrics, system.events, system.activeRequests RPC handlers
- `src/gateway/handlers/agent.ts` - Metrics recording in agent.send: startRequest/endRequest, message/error counters, error events, tool failure events
- `src/gateway/lane-queue.ts` - totalPending() method for queue depth metric
## Decisions Made
- MetricsCollector self-contained in GatewayServer constructor — no changes to services.ts needed
- Ring buffer sizes: 200 model calls, 500 events (configurable via constructor)
- Passed MetricsCollector instance directly to agent handler deps instead of individual callbacks — cleaner API
- startRequest called before laneQueue.enqueue to track full queuing + execution duration
- Tool failures recorded as separate error events with tool name context
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
- All metrics RPC endpoints ready for Plan 02 (Dashboard UI) to consume
- system.metrics returns snapshot with counters, model call stats, queue depth
- system.events returns filtered/limited events (newest first)
- system.activeRequests returns in-flight request details
- GET /health available for external monitoring integration
## Self-Check: PASSED
All 7 files verified present. Both task commits (bd1880a, a0feff9) verified in git log.
---
*Phase: 03-live-ops-dashboard*
*Completed: 2026-02-10*