Files
agentmon/docs/plans/2026-03-13-agent-monitoring-design.md
William Valentin 3434db3c59 feat: complete agent monitoring - hook, UI, and backend filter
- Add event_type and framework filters to events query endpoint
- Add /agents SPA route to web-ui server
- Add Agents nav link and route in frontend
- Add agents page CSS (timeline, VM pills, stats panel)
- Build VM status strip, activity timeline, and real-time stats
- Add agentmon hook for OpenClaw (HOOK.md + handler.ts)
- Add docker-compose, Dockerfile, and supporting infra files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 00:26:42 -07:00

148 lines
6.8 KiB
Markdown

# Agent Activity Monitoring via OpenClaw Hooks
**Date:** 2026-03-13
**Status:** Approved
## Goal
Monitor all OpenClaw agent and subagent activity across the three VMs (zap, orb, sun) — tool calls, conversation flow, token usage, session lifecycle, and errors — and display it in a real-time dashboard in the agentmon web UI.
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ VM (zap / orb / sun) │
│ │
│ OpenClaw Gateway │
│ ├── agent loop (messages, tools, sessions) │
│ └── agentmon-hook (TypeScript) │
│ │ listens to: message:received/sent, │
│ │ tool_result_persist, command:*, session:* │
│ │ │
│ └──── POST /v1/events ─────────────────┐ │
│ │ │
└──────────────────────────────────────────────────│───┘
┌──────────────────────────────────────────────────────┐
│ Host │
│ agentmon ingest-gateway (:8080) │
│ → NATS → event-processor → Postgres │
│ → query-api → web-ui (new "Agents" page) │
└──────────────────────────────────────────────────────┘
```
One hook deployed to all three VMs captures everything and ships it to the existing agentmon pipeline. No changes needed to ingest, NATS, or storage.
## Event Mapping
| OpenClaw Event | agentmon Event | What it captures |
|---|---|---|
| `command:new` | `session.start` | Agent session begins |
| `command:stop` / `command:reset` | `session.end` | Session ends |
| `message:received` | `run.start` | Inbound message starts a turn |
| `message:sent` | `run.end` | Agent response completes the turn |
| `tool_result_persist` | `span.start` + `span.end` | Tool call with result |
| `session:compact:before/after` | `span` (kind: `internal`) | Context window management |
### Correlation
- `session_id` = OpenClaw `sessionKey`
- `run_id` = generated UUID per inbound message, carried through to `message:sent`
- `framework` = `"openclaw"`
- `client_id` = VM name (zap / orb / sun)
Token usage and cost attached via `WithLLMUsage` attributes on `run.end` events if the `message:sent` payload includes usage metadata.
## Hook Design
### Directory Structure
```
~/.openclaw/hooks/agentmon/
├── HOOK.md # metadata: events, requirements
├── handler.ts # event capture + HTTP emit
└── package.json # minimal deps
```
### Deployment
SCP the directory to each VM. The hook auto-discovers via OpenClaw's hook loading — no config changes needed beyond having hooks enabled.
The hook POSTs to the host machine's ingest gateway. VMs are on the libvirt bridge (192.168.122.x), so the gateway URL is configured as an env var or uses the host's bridge IP.
### Resilience
- Fire-and-forget with a small in-memory buffer (batch up to 10 events or 2s, whichever comes first)
- 500ms timeout on fetch calls — if agentmon is slow, skip and move on
- Events that fail to send are logged locally but not retried
- The hook must never slow down the OpenClaw agent loop
## Error Handling
### In the hook
- All HTTP POSTs wrapped in try/catch — never throw, never block
- Malformed event payloads (missing sessionKey, etc.) silently dropped with debug log
### In the pipeline
- Ingest gateway deduplicates by event ID — safe if a hook sends twice
- Events with `framework: "openclaw"` but missing correlation IDs get stored but won't appear in the agents timeline
### Edge cases
- VM reboots mid-session: no `session.end` emitted — UI shows session as "ongoing" until a new `command:new` arrives
- OpenClaw compacts context before hook fires: `session:compact:after` still fires, captured as internal span
- Network partition between VM and host: events silently lost, no backfill — acceptable for monitoring
## UI — Agents Page
### Layout
A live activity dashboard at `/agents` with three sections:
1. **Top strip**: Three VM pill indicators (zap / orb / sun) showing online/offline with a subtle pulse when active
2. **Activity timeline**: Vertical feed of events across all agents — messages, tool calls, errors — with VM name color-coded, monospace timestamps, and collapsible tool call detail rows. Real-time via existing WebSocket.
3. **Side stats panel**: Aggregate metrics — messages/hour, tool calls today, error rate, most-used tools
### Aesthetic
Matches the refined dark theme already in place:
- Timeline cards with glassmorphism
- Color-coded VM badges
- Monospace timestamps (Fira Code)
- Syne display font for headings
- Fade-in animations on new events
- Status pill badges consistent with existing design system
## Implementation Plan
### Phase 1: OpenClaw Hook
1. Create hook directory structure (`HOOK.md`, `handler.ts`)
2. Implement event-to-agentmon mapper — translate each OpenClaw event type to the agentmon envelope schema
3. HTTP emitter with buffering (batch up to 10 events or 2s, whichever first) and 500ms timeout
4. Unit test the mapper logic locally
### Phase 2: Agentmon UI — Agents Page
5. Add `/agents` route to the SPA router in `app.js`
6. Add "Agents" nav link in header
7. Build the top strip — three VM status pills pulling from existing `openclaw.snapshot` data
8. Build the activity timeline — subscribe to WebSocket, filter for `framework: "openclaw"` events, render as vertical feed with collapsible tool call details
9. Build the side stats panel — aggregate counts from the query API (messages/hour, tool calls, error rate, top tools)
10. Style with the refined dark aesthetic — glassmorphism timeline cards, color-coded VM badges, monospace timestamps, fade-in animations
### Phase 3: Deploy
11. SCP hook to all three VMs, verify auto-discovery
12. Send a test message to one agent, confirm events flow end-to-end
## Not in Scope (Future)
- Token/cost dashboard (needs usage data verification in `message:sent` payloads)
- Historical analytics and aggregation queries
- Hook auto-deployment via openclaw-monitor
- Alerting on error rate spikes