Files
William Valentin ebc944702f chore: drop retired orb and sun VMs
Only the zap VM remains in the fleet. Remove orb/sun from the README
architecture/config docs, the getVMClassName allowlist, and their
.timeline-vm-tag color styles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 10:38:04 -07:00

13 KiB

agentmon

Telemetry and monitoring system for AI agent activity across OpenClaw instances running on KVM virtual machines. Captures sessions, runs, tool calls, errors, and VM health metrics — viewable in a real-time web dashboard.

Architecture

                         ┌──────────────────────────┐
                         │   OpenClaw VMs            │
                         │   (zap)                   │
                         │                           │
                         │   hooks/agentmon/         │
                         │     → handler.ts          │
                         └──────────┬───────────────┘
                                    │ HTTP POST
                                    ▼
┌─────────────┐   publish   ┌──────────────┐
│  openclaw-  │────────────▶│    NATS      │
│  monitor    │             │   :4222      │
│ (VM polls)  │             └──────┬───────┘
└─────────────┘                    │ subscribe
                                   ▼
                         ┌──────────────────┐
                         │ event-processor  │
                         └────────┬─────────┘
                                  │ INSERT
                                  ▼
┌─────────────┐  query   ┌──────────────┐   proxy   ┌──────────────┐
│  web-ui     │◀────────▶│  query-api   │◀──────────│  browser     │
│  :8082      │          │  :8081       │            └──────────────┘
└─────────────┘          └──────────────┘
                                  ▲
                                  │
                         ┌────────┴───────┐
                         │   PostgreSQL   │
                         │   :5432        │
                         └────────────────┘

Data flow: OpenClaw hooks emit telemetry events over HTTP to the ingest gateway, which publishes them to NATS. The event processor subscribes and persists events to PostgreSQL. The query API serves aggregated data (sessions, runs, spans) to the web UI. A separate openclaw-monitor polls VM health metrics (CPU, memory, disk, service status) via libvirt and SSH.

Real-time updates flow through NATS → query-api → WebSocket → browser.

Services

Service Port Description
ingest-gateway 8080 HTTP + WebSocket event ingestion, publishes to NATS
query-api 8081 REST API for sessions, runs, spans; WebSocket live feed
web-ui 8082 SPA frontend with reverse proxy to query-api
event-processor NATS subscriber, persists events to Postgres
openclaw-monitor Polls VM instances via libvirt/SSH, emits snapshots
postgres 5432 Event storage
nats 4222 Message queue (JetStream)

Quick Start

cp .env.example .env
make up

This starts Postgres, NATS, and all application services via Docker Compose. Open http://localhost:8082.

For local development, start infrastructure only and run services manually:

make up                    # postgres + nats
make run-ingest            # terminal 1
make run-query             # terminal 2
make run-ui                # terminal 3
make run-processor         # terminal 4
make run-openclaw-monitor  # terminal 5

Or use the convenience scripts:

./start-all.sh    # start everything
./stop-all.sh     # stop everything

Configuration

Environment variables (see .env.example):

Variable Default Description
DATABASE_URL Postgres connection string (required)
NATS_URL nats://nats:4222 NATS server address
NATS_TOPIC agentmon.events.v1 NATS topic for events
AGENTMON_ADDR :8080 Ingest gateway listen address
AGENTMON_QUERY_ADDR :8081 Query API listen address
AGENTMON_UI_ADDR :8082 Web UI listen address
AGENTMON_QUERY_BASE http://query-api Query API URL (for web-ui proxy)
OPENCLAW_REGISTRY ~/.claude/state/openclaw-instances.json VM instance registry
POLL_INTERVAL 30s VM polling interval

API

Ingest Gateway (:8080)

GET  /healthz              Health check
POST /v1/events            Batch event ingestion (JSON array)
GET  /v1/ws                WebSocket event stream

Query API (:8081)

GET  /healthz              Health check
GET  /v1/events            List events (?event_type=&framework=&limit=)
GET  /v1/sessions          List sessions (?from=&to=&framework=&host=&cursor=&limit=)
GET  /v1/sessions/{id}     Session detail with runs
GET  /v1/runs/{id}         Run detail with spans
GET  /v1/stats/summary     Today's aggregate stats (active sessions, runs, tools, errors by framework)
GET  /v1/stats/timeseries  Bucketed event counts (?window=1h|6h|24h|7d)
GET  /v1/ws                WebSocket live event broadcast

Event Schema

Events follow the agentmon.event envelope format:

{
  "schema": { "name": "agentmon.event", "version": 1 },
  "event": {
    "id": "uuid",
    "type": "session.start",
    "ts": "2026-03-13T12:00:00Z",
    "source": {
      "framework": "openclaw",
      "client_id": "zap",
      "host": "zap"
    }
  },
  "correlation": {
    "session_id": "uuid",
    "run_id": "uuid",
    "span_id": "uuid"
  },
  "attributes": {},
  "payload": {}
}

Event types: session.start, session.end, run.start, run.end, span.start, span.end, error, metric.snapshot, openclaw.snapshot

Database Schema

CREATE TABLE events (
  event_id       TEXT PRIMARY KEY,
  ts             TIMESTAMPTZ NOT NULL,
  type           TEXT NOT NULL,
  session_id     TEXT,
  run_id         TEXT,
  trace_id       TEXT,
  span_id        TEXT,
  parent_span_id TEXT,
  source_framework TEXT,
  client_id      TEXT,
  payload        JSONB NOT NULL
);

OpenClaw Hook

The hooks/agentmon/ directory contains a TypeScript hook that captures agent activity from OpenClaw instances and emits it to the ingest gateway. It maps OpenClaw events to agentmon's session/run/span model:

OpenClaw Event agentmon Event Description
command:new session.start New conversation started
command:stop session.end Conversation ended
command:reset session.end + session.start Conversation reset
message:received run.start User message received
message:sent run.end Agent response sent
tool_result_persist span.end Tool call completed
session:compact:before span.start Context compaction started
session:compact:after span.end Context compaction finished

Deploying the hook

The hook is deployed to each VM at ~/.openclaw/hooks/agentmon/. Two environment variables are required in ~/.openclaw/.env:

AGENTMON_INGEST_URL=http://192.168.122.1:8080
AGENTMON_VM_NAME=zap

Deployment is automated via Ansible — see the swarm ansible playbook playbooks/customize.yml.

Codex Hook

The hooks/codex/ directory contains a TypeScript handler for Codex CLI telemetry. Current Codex support is session/run oriented:

  • sessionStart and sessionEnd map to session.start, run.start, run.end, and session.end
  • notify maps turn-complete notifications into run.end
  • prompt-submit hooks map user prompts into the next run.start
  • usage payloads emit both run.end.payload.usage and a metric.snapshot event

The Codex handler persists lightweight session state across hook subprocesses. If Codex only delivers later-stage hooks for a session, the handler can recover by emitting synthetic session.start/run.start events before the first run.end or usage snapshot. Full-fidelity lifecycle tracking still depends on configuring Codex session lifecycle hooks, not just notify.

Sample Codex hook configuration lives in hooks/codex/hooks.json. On the local Codex CLI version we checked (0.116.0), notify is confirmed. Online reports suggest prompt-submit hooks may appear as userpromptsubmit or userPromptSubmit, so the sample config includes those aliases.

The current Codex integration does not assume tool or subagent span hooks exist. If a newer Codex CLI exposes official tool/span hooks, they can be added separately without changing the run/session flow above.

Gemini Hook

The hooks/gemini/ directory contains a TypeScript handler for Gemini CLI telemetry. The current integration maps Gemini hook events into agentmon's session/run/span model:

  • onStart maps to session.start and an initial run.start
  • onStop maps to run.end and session.end
  • onToolCall maps to span.start
  • onToolResult maps to span.end

Sample Gemini hook configuration lives in hooks/gemini/hooks.json. Install the handler from that directory so the agentmon-gemini-handler binary is available, then point Gemini CLI at the sample hook config and set AGENTMON_INGEST_URL to your ingest gateway.

Hermes Hook

The hooks/hermes/ directory contains a TypeScript handler for Hermes Agent shell-hook telemetry. The current integration maps Hermes hook events into agentmon's session/run/span model:

  • on_session_start maps to session.start
  • pre_llm_call maps to run.start
  • post_llm_call maps to run.end
  • pre_tool_call maps to span.start
  • post_tool_call maps to span.end
  • post_api_request maps usage payloads to metric.snapshot
  • on_session_finalize maps to session.end

Sample Hermes hook configuration lives in hooks/hermes/hooks.yaml. Install the handler from that directory so the agentmon-hermes-handler binary is available, then merge the sample hooks: block into ~/.hermes/config.yaml and set AGENTMON_INGEST_URL to your ingest gateway.

Go SDK

Emit events from Go applications:

emitter, err := sdk.NewEmitter(sdk.Config{
    ServerURL: "http://localhost:8080",
    Framework: "my-agent",
    ClientID:  "client-001",
    Host:      "localhost",
})
defer emitter.Close(ctx)

emitter.Emit(ctx, sdk.NewSessionStart(sessionID, sdk.WithSource(emitter)))
emitter.Emit(ctx, sdk.NewRunStart(sessionID, runID))
emitter.Emit(ctx, sdk.NewRunEnd(sessionID, runID, sdk.WithPayload(map[string]any{
    "status": "success",
    "duration_ms": 1234,
})))

Web UI

The web UI has five views:

  • Dashboard (/) — real-time overview with summary stats (active sessions, runs, tools, errors), uPlot time-series charts with selectable windows (1h/6h/24h/7d), framework breakdown bars, live activity feed, and top tools ranking. All sections update live via WebSocket.
  • Sessions (/sessions) — browse all agent sessions with date range, framework, and host filters
  • Session Detail (/sessions/{id}) — view runs within a session, drill into individual runs and spans
  • Agents (/agents) — live timeline of OpenClaw agent events with VM status pills and statistics
  • OpenClaw (/openclaw) — real-time grid of VM health cards (state, CPU, memory, disk, gateway status, issues)

Development

make test     # run tests
make tidy     # go mod tidy
make logs     # docker compose logs
make down     # stop everything

Project Structure

cmd/
├── ingest-gateway/      HTTP event ingestion service
├── query-api/           REST API for querying events
├── web-ui/              SPA frontend + static assets
│   └── static/          HTML, CSS, JS
├── event-processor/     NATS → Postgres persistence
└── openclaw-monitor/    VM health polling
internal/
├── event/               Envelope types and validation
├── httpx/               HTTP response helpers
├── queue/nats/          NATS publisher and subscriber
├── store/postgres/      Database queries (sessions, runs, spans, stats)
├── sdk/                 Go client library for emitting events
└── monitor/openclaw/    VM metrics collection (libvirt, SSH)
hooks/
└── agentmon/            OpenClaw hook (TypeScript)
deploy/
└── k8s/                 Database schema (postgres.sql)