Files
agentmon/docs/plans/2026-03-13-agent-monitoring-design.md
T
William Valentin 3434db3c59 feat: complete agent monitoring - hook, UI, and backend filter
- Add event_type and framework filters to events query endpoint
- Add /agents SPA route to web-ui server
- Add Agents nav link and route in frontend
- Add agents page CSS (timeline, VM pills, stats panel)
- Build VM status strip, activity timeline, and real-time stats
- Add agentmon hook for OpenClaw (HOOK.md + handler.ts)
- Add docker-compose, Dockerfile, and supporting infra files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-14 00:26:42 -07:00

6.8 KiB

Agent Activity Monitoring via OpenClaw Hooks

Date: 2026-03-13 Status: Approved

Goal

Monitor all OpenClaw agent and subagent activity across the three VMs (zap, orb, sun) — tool calls, conversation flow, token usage, session lifecycle, and errors — and display it in a real-time dashboard in the agentmon web UI.

Architecture

┌─────────────────────────────────────────────────────┐
│  VM (zap / orb / sun)                               │
│                                                     │
│  OpenClaw Gateway                                   │
│    ├── agent loop (messages, tools, sessions)        │
│    └── agentmon-hook (TypeScript)                    │
│          │  listens to: message:received/sent,       │
│          │  tool_result_persist, command:*, session:* │
│          │                                           │
│          └──── POST /v1/events ─────────────────┐   │
│                                                  │   │
└──────────────────────────────────────────────────│───┘
                                                   │
                                                   ▼
┌──────────────────────────────────────────────────────┐
│  Host                                                │
│  agentmon ingest-gateway (:8080)                     │
│    → NATS → event-processor → Postgres               │
│    → query-api → web-ui (new "Agents" page)          │
└──────────────────────────────────────────────────────┘

One hook deployed to all three VMs captures everything and ships it to the existing agentmon pipeline. No changes needed to ingest, NATS, or storage.

Event Mapping

OpenClaw Event agentmon Event What it captures
command:new session.start Agent session begins
command:stop / command:reset session.end Session ends
message:received run.start Inbound message starts a turn
message:sent run.end Agent response completes the turn
tool_result_persist span.start + span.end Tool call with result
session:compact:before/after span (kind: internal) Context window management

Correlation

  • session_id = OpenClaw sessionKey
  • run_id = generated UUID per inbound message, carried through to message:sent
  • framework = "openclaw"
  • client_id = VM name (zap / orb / sun)

Token usage and cost attached via WithLLMUsage attributes on run.end events if the message:sent payload includes usage metadata.

Hook Design

Directory Structure

~/.openclaw/hooks/agentmon/
├── HOOK.md          # metadata: events, requirements
├── handler.ts       # event capture + HTTP emit
└── package.json     # minimal deps

Deployment

SCP the directory to each VM. The hook auto-discovers via OpenClaw's hook loading — no config changes needed beyond having hooks enabled.

The hook POSTs to the host machine's ingest gateway. VMs are on the libvirt bridge (192.168.122.x), so the gateway URL is configured as an env var or uses the host's bridge IP.

Resilience

  • Fire-and-forget with a small in-memory buffer (batch up to 10 events or 2s, whichever comes first)
  • 500ms timeout on fetch calls — if agentmon is slow, skip and move on
  • Events that fail to send are logged locally but not retried
  • The hook must never slow down the OpenClaw agent loop

Error Handling

In the hook

  • All HTTP POSTs wrapped in try/catch — never throw, never block
  • Malformed event payloads (missing sessionKey, etc.) silently dropped with debug log

In the pipeline

  • Ingest gateway deduplicates by event ID — safe if a hook sends twice
  • Events with framework: "openclaw" but missing correlation IDs get stored but won't appear in the agents timeline

Edge cases

  • VM reboots mid-session: no session.end emitted — UI shows session as "ongoing" until a new command:new arrives
  • OpenClaw compacts context before hook fires: session:compact:after still fires, captured as internal span
  • Network partition between VM and host: events silently lost, no backfill — acceptable for monitoring

UI — Agents Page

Layout

A live activity dashboard at /agents with three sections:

  1. Top strip: Three VM pill indicators (zap / orb / sun) showing online/offline with a subtle pulse when active
  2. Activity timeline: Vertical feed of events across all agents — messages, tool calls, errors — with VM name color-coded, monospace timestamps, and collapsible tool call detail rows. Real-time via existing WebSocket.
  3. Side stats panel: Aggregate metrics — messages/hour, tool calls today, error rate, most-used tools

Aesthetic

Matches the refined dark theme already in place:

  • Timeline cards with glassmorphism
  • Color-coded VM badges
  • Monospace timestamps (Fira Code)
  • Syne display font for headings
  • Fade-in animations on new events
  • Status pill badges consistent with existing design system

Implementation Plan

Phase 1: OpenClaw Hook

  1. Create hook directory structure (HOOK.md, handler.ts)
  2. Implement event-to-agentmon mapper — translate each OpenClaw event type to the agentmon envelope schema
  3. HTTP emitter with buffering (batch up to 10 events or 2s, whichever first) and 500ms timeout
  4. Unit test the mapper logic locally

Phase 2: Agentmon UI — Agents Page

  1. Add /agents route to the SPA router in app.js
  2. Add "Agents" nav link in header
  3. Build the top strip — three VM status pills pulling from existing openclaw.snapshot data
  4. Build the activity timeline — subscribe to WebSocket, filter for framework: "openclaw" events, render as vertical feed with collapsible tool call details
  5. Build the side stats panel — aggregate counts from the query API (messages/hour, tool calls, error rate, top tools)
  6. Style with the refined dark aesthetic — glassmorphism timeline cards, color-coded VM badges, monospace timestamps, fade-in animations

Phase 3: Deploy

  1. SCP hook to all three VMs, verify auto-discovery
  2. Send a test message to one agent, confirm events flow end-to-end

Not in Scope (Future)

  • Token/cost dashboard (needs usage data verification in message:sent payloads)
  • Historical analytics and aggregation queries
  • Hook auto-deployment via openclaw-monitor
  • Alerting on error rate spikes