Tool spans already carry duration_ms and status, but the metrics layer
only counted them. Expose that data:
- GetTopTools now returns avg/p95 duration and error count per tool.
- Timeseries buckets gain tool_avg_ms / tool_p95_ms (filtered
percentile_cont over tool spans).
- Dashboard Top Tools shows avg latency per tool; the Latency panel,
previously always empty (it read run-level duration that is never
emitted), now plots real tool-span latency (min/avg/p95).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Label all agentmon docker-compose services with agentmon.monitor=true
and agentmon.group=agentmon so the swarm-monitor picks them up.
Adds Group field to ServiceSnapshot, probes /healthz for api/web roles,
and renders a separate "Agentmon" section below Swarm Services on the
Infrastructure page with new api and worker card renderers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Active sessions query now finds truly active sessions (started
anytime, no session.end ever) instead of only today's sessions
- Use uPlot setData() for live WS updates instead of destroying
and recreating the chart on every event
- Destroy chart only on window change so it recreates with new scale
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>