docs(03-01): complete metrics collection backend plan
- SUMMARY.md with task commits, decisions, self-check - STATE.md updated: phase 3 in_progress, 1/2 plans, test count 1107
This commit is contained in:
@@ -0,0 +1,112 @@
|
||||
---
|
||||
phase: 03-live-ops-dashboard
|
||||
plan: 01
|
||||
subsystem: gateway
|
||||
tags: [metrics, ring-buffer, rpc, health-endpoint, monitoring]
|
||||
|
||||
# Dependency graph
|
||||
requires:
|
||||
- phase: 01-daemon-decomposition
|
||||
provides: GatewayServer, LaneQueue, handler architecture, session bridge
|
||||
provides:
|
||||
- MetricsCollector class with counters, model call ring buffer, event ring buffer, active request tracking
|
||||
- system.metrics, system.events, system.activeRequests RPC handlers
|
||||
- GET /health unauthenticated HTTP endpoint
|
||||
- Metrics recording wired into agent.send request flow
|
||||
affects: [03-02-PLAN (dashboard UI consumes these RPC methods)]
|
||||
|
||||
# Tech tracking
|
||||
tech-stack:
|
||||
added: []
|
||||
patterns: [ring-buffer with FIFO eviction, optional-chaining metrics injection, gauge counters]
|
||||
|
||||
key-files:
|
||||
created:
|
||||
- src/gateway/metrics.ts
|
||||
- src/gateway/metrics.test.ts
|
||||
modified:
|
||||
- src/gateway/server.ts
|
||||
- src/gateway/handlers/system.ts
|
||||
- src/gateway/handlers/agent.ts
|
||||
- src/gateway/lane-queue.ts
|
||||
|
||||
key-decisions:
|
||||
- "MetricsCollector created inside GatewayServer constructor (self-contained, no services.ts changes needed)"
|
||||
- "Ring buffers: 200 model calls, 500 events — reasonable for dashboard display without unbounded growth"
|
||||
- "Metrics passed to agent handler as optional MetricsCollector instance (not individual callbacks)"
|
||||
- "startRequest called before laneQueue.enqueue, endRequest in finally block — tracks full queuing + execution time"
|
||||
|
||||
patterns-established:
|
||||
- "Optional metrics injection: deps.metrics?.method() pattern for zero-cost when metrics disabled"
|
||||
- "Ring buffer with shift() eviction for bounded memory in long-running daemon"
|
||||
- "Unauthenticated /health endpoint before auth check for Docker HEALTHCHECK compatibility"
|
||||
|
||||
# Metrics
|
||||
duration: 2min
|
||||
completed: 2026-02-10
|
||||
---
|
||||
|
||||
# Phase 3 Plan 1: Metrics Collection Backend Summary
|
||||
|
||||
**MetricsCollector with counters, ring buffers, and active request tracking, exposed via 3 RPC handlers and /health HTTP endpoint, wired into agent.send flow**
|
||||
|
||||
## Performance
|
||||
|
||||
- **Duration:** ~2 min
|
||||
- **Started:** 2026-02-10T05:27:59Z
|
||||
- **Completed:** 2026-02-10T05:29:33Z
|
||||
- **Tasks:** 2/2
|
||||
- **Files modified:** 6
|
||||
|
||||
## Accomplishments
|
||||
- MetricsCollector class tracking messages processed, errors, active requests, model call latency, and event stream
|
||||
- Three new RPC handlers (system.metrics, system.events, system.activeRequests) for dashboard consumption
|
||||
- GET /health unauthenticated endpoint returning JSON status for Docker HEALTHCHECK
|
||||
- Agent request flow records metrics: message counts, error events, tool failure events, active request tracking
|
||||
|
||||
## Task Commits
|
||||
|
||||
Each task was committed atomically:
|
||||
|
||||
1. **Task 1: Create MetricsCollector and wire into gateway** - `bd1880a` (feat)
|
||||
2. **Task 2: Hook metrics recording into agent request flow** - `a0feff9` (feat)
|
||||
|
||||
## Files Created/Modified
|
||||
- `src/gateway/metrics.ts` - MetricsCollector class with counters, ring buffers, active request map, snapshot method
|
||||
- `src/gateway/metrics.test.ts` - 20 tests covering counters, ring buffer limits, event filtering, active request tracking, snapshot shape
|
||||
- `src/gateway/server.ts` - MetricsCollector creation in constructor, /health HTTP endpoint, metrics callbacks to handlers
|
||||
- `src/gateway/handlers/system.ts` - system.metrics, system.events, system.activeRequests RPC handlers
|
||||
- `src/gateway/handlers/agent.ts` - Metrics recording in agent.send: startRequest/endRequest, message/error counters, error events, tool failure events
|
||||
- `src/gateway/lane-queue.ts` - totalPending() method for queue depth metric
|
||||
|
||||
## Decisions Made
|
||||
- MetricsCollector self-contained in GatewayServer constructor — no changes to services.ts needed
|
||||
- Ring buffer sizes: 200 model calls, 500 events (configurable via constructor)
|
||||
- Passed MetricsCollector instance directly to agent handler deps instead of individual callbacks — cleaner API
|
||||
- startRequest called before laneQueue.enqueue to track full queuing + execution duration
|
||||
- Tool failures recorded as separate error events with tool name context
|
||||
|
||||
## Deviations from Plan
|
||||
|
||||
None - plan executed exactly as written.
|
||||
|
||||
## Issues Encountered
|
||||
None
|
||||
|
||||
## User Setup Required
|
||||
None - no external service configuration required.
|
||||
|
||||
## Next Phase Readiness
|
||||
- All metrics RPC endpoints ready for Plan 02 (Dashboard UI) to consume
|
||||
- system.metrics returns snapshot with counters, model call stats, queue depth
|
||||
- system.events returns filtered/limited events (newest first)
|
||||
- system.activeRequests returns in-flight request details
|
||||
- GET /health available for external monitoring integration
|
||||
|
||||
## Self-Check: PASSED
|
||||
|
||||
All 7 files verified present. Both task commits (bd1880a, a0feff9) verified in git log.
|
||||
|
||||
---
|
||||
*Phase: 03-live-ops-dashboard*
|
||||
*Completed: 2026-02-10*
|
||||
Reference in New Issue
Block a user