Files

T

William Valentin 3434db3c59 feat: complete agent monitoring - hook, UI, and backend filter

- Add event_type and framework filters to events query endpoint
- Add /agents SPA route to web-ui server
- Add Agents nav link and route in frontend
- Add agents page CSS (timeline, VM pills, stats panel)
- Build VM status strip, activity timeline, and real-time stats
- Add agentmon hook for OpenClaw (HOOK.md + handler.ts)
- Add docker-compose, Dockerfile, and supporting infra files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-14 00:26:42 -07:00

6.8 KiB

Raw Blame History

Agent Activity Monitoring via OpenClaw Hooks

Date: 2026-03-13 Status: Approved

Goal

Monitor all OpenClaw agent and subagent activity across the three VMs (zap, orb, sun) — tool calls, conversation flow, token usage, session lifecycle, and errors — and display it in a real-time dashboard in the agentmon web UI.

Architecture

┌─────────────────────────────────────────────────────┐
│  VM (zap / orb / sun)                               │
│                                                     │
│  OpenClaw Gateway                                   │
│    ├── agent loop (messages, tools, sessions)        │
│    └── agentmon-hook (TypeScript)                    │
│          │  listens to: message:received/sent,       │
│          │  tool_result_persist, command:*, session:* │
│          │                                           │
│          └──── POST /v1/events ─────────────────┐   │
│                                                  │   │
└──────────────────────────────────────────────────│───┘
                                                   │
                                                   ▼
┌──────────────────────────────────────────────────────┐
│  Host                                                │
│  agentmon ingest-gateway (:8080)                     │
│    → NATS → event-processor → Postgres               │
│    → query-api → web-ui (new "Agents" page)          │
└──────────────────────────────────────────────────────┘

One hook deployed to all three VMs captures everything and ships it to the existing agentmon pipeline. No changes needed to ingest, NATS, or storage.

Event Mapping

OpenClaw Event	agentmon Event	What it captures
`command:new`	`session.start`	Agent session begins
`command:stop` / `command:reset`	`session.end`	Session ends
`message:received`	`run.start`	Inbound message starts a turn
`message:sent`	`run.end`	Agent response completes the turn
`tool_result_persist`	`span.start` + `span.end`	Tool call with result
`session:compact:before/after`	`span` (kind: `internal`)	Context window management

Correlation

session_id = OpenClaw sessionKey
run_id = generated UUID per inbound message, carried through to message:sent
framework = "openclaw"
client_id = VM name (zap / orb / sun)

Token usage and cost attached via WithLLMUsage attributes on run.end events if the message:sent payload includes usage metadata.

Hook Design

Directory Structure

~/.openclaw/hooks/agentmon/
├── HOOK.md          # metadata: events, requirements
├── handler.ts       # event capture + HTTP emit
└── package.json     # minimal deps

Deployment

SCP the directory to each VM. The hook auto-discovers via OpenClaw's hook loading — no config changes needed beyond having hooks enabled.

The hook POSTs to the host machine's ingest gateway. VMs are on the libvirt bridge (192.168.122.x), so the gateway URL is configured as an env var or uses the host's bridge IP.

Resilience

Fire-and-forget with a small in-memory buffer (batch up to 10 events or 2s, whichever comes first)
500ms timeout on fetch calls — if agentmon is slow, skip and move on
Events that fail to send are logged locally but not retried
The hook must never slow down the OpenClaw agent loop

Error Handling

In the hook

All HTTP POSTs wrapped in try/catch — never throw, never block
Malformed event payloads (missing sessionKey, etc.) silently dropped with debug log

In the pipeline

Ingest gateway deduplicates by event ID — safe if a hook sends twice
Events with framework: "openclaw" but missing correlation IDs get stored but won't appear in the agents timeline

Edge cases

VM reboots mid-session: no session.end emitted — UI shows session as "ongoing" until a new command:new arrives
OpenClaw compacts context before hook fires: session:compact:after still fires, captured as internal span
Network partition between VM and host: events silently lost, no backfill — acceptable for monitoring

UI — Agents Page

Layout

A live activity dashboard at /agents with three sections:

Top strip: Three VM pill indicators (zap / orb / sun) showing online/offline with a subtle pulse when active
Activity timeline: Vertical feed of events across all agents — messages, tool calls, errors — with VM name color-coded, monospace timestamps, and collapsible tool call detail rows. Real-time via existing WebSocket.
Side stats panel: Aggregate metrics — messages/hour, tool calls today, error rate, most-used tools

Aesthetic

Matches the refined dark theme already in place:

Timeline cards with glassmorphism
Color-coded VM badges
Monospace timestamps (Fira Code)
Syne display font for headings
Fade-in animations on new events
Status pill badges consistent with existing design system

Implementation Plan

Phase 1: OpenClaw Hook

Create hook directory structure (HOOK.md, handler.ts)
Implement event-to-agentmon mapper — translate each OpenClaw event type to the agentmon envelope schema
HTTP emitter with buffering (batch up to 10 events or 2s, whichever first) and 500ms timeout
Unit test the mapper logic locally

Phase 2: Agentmon UI — Agents Page

Add /agents route to the SPA router in app.js
Add "Agents" nav link in header
Build the top strip — three VM status pills pulling from existing openclaw.snapshot data
Build the activity timeline — subscribe to WebSocket, filter for framework: "openclaw" events, render as vertical feed with collapsible tool call details
Build the side stats panel — aggregate counts from the query API (messages/hour, tool calls, error rate, top tools)
Style with the refined dark aesthetic — glassmorphism timeline cards, color-coded VM badges, monospace timestamps, fade-in animations

Phase 3: Deploy

SCP hook to all three VMs, verify auto-discovery
Send a test message to one agent, confirm events flow end-to-end

Not in Scope (Future)

Token/cost dashboard (needs usage data verification in message:sent payloads)
Historical analytics and aggregation queries
Hook auto-deployment via openclaw-monitor
Alerting on error rate spikes

6.8 KiB Raw Blame History