Files
agentmon/README.md
T
2026-04-21 13:02:58 -07:00

283 lines
12 KiB
Markdown

# agentmon
Telemetry and monitoring system for AI agent activity across [OpenClaw](https://openclaw.ai/) instances running on KVM virtual machines. Captures sessions, runs, tool calls, errors, and VM health metrics — viewable in a real-time web dashboard.
## Architecture
```
┌──────────────────────────┐
│ OpenClaw VMs │
│ (zap, orb, sun) │
│ │
│ hooks/agentmon/ │
│ → handler.ts │
└──────────┬───────────────┘
│ HTTP POST
┌─────────────┐ publish ┌──────────────┐
│ openclaw- │────────────▶│ NATS │
│ monitor │ │ :4222 │
│ (VM polls) │ └──────┬───────┘
└─────────────┘ │ subscribe
┌──────────────────┐
│ event-processor │
└────────┬─────────┘
│ INSERT
┌─────────────┐ query ┌──────────────┐ proxy ┌──────────────┐
│ web-ui │◀────────▶│ query-api │◀──────────│ browser │
│ :8082 │ │ :8081 │ └──────────────┘
└─────────────┘ └──────────────┘
┌────────┴───────┐
│ PostgreSQL │
│ :5432 │
└────────────────┘
```
**Data flow:** OpenClaw hooks emit telemetry events over HTTP to the **ingest gateway**, which publishes them to **NATS**. The **event processor** subscribes and persists events to **PostgreSQL**. The **query API** serves aggregated data (sessions, runs, spans) to the **web UI**. A separate **openclaw-monitor** polls VM health metrics (CPU, memory, disk, service status) via libvirt and SSH.
Real-time updates flow through NATS → query-api → WebSocket → browser.
## Services
| Service | Port | Description |
|---------|------|-------------|
| **ingest-gateway** | 8080 | HTTP + WebSocket event ingestion, publishes to NATS |
| **query-api** | 8081 | REST API for sessions, runs, spans; WebSocket live feed |
| **web-ui** | 8082 | SPA frontend with reverse proxy to query-api |
| **event-processor** | — | NATS subscriber, persists events to Postgres |
| **openclaw-monitor** | — | Polls VM instances via libvirt/SSH, emits snapshots |
| **postgres** | 5432 | Event storage |
| **nats** | 4222 | Message queue (JetStream) |
## Quick Start
```bash
cp .env.example .env
make up
```
This starts Postgres, NATS, and all application services via Docker Compose. Open http://localhost:8082.
For local development, start infrastructure only and run services manually:
```bash
make up # postgres + nats
make run-ingest # terminal 1
make run-query # terminal 2
make run-ui # terminal 3
make run-processor # terminal 4
make run-openclaw-monitor # terminal 5
```
Or use the convenience scripts:
```bash
./start-all.sh # start everything
./stop-all.sh # stop everything
```
## Configuration
Environment variables (see `.env.example`):
| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | — | Postgres connection string (required) |
| `NATS_URL` | `nats://nats:4222` | NATS server address |
| `NATS_TOPIC` | `agentmon.events.v1` | NATS topic for events |
| `AGENTMON_ADDR` | `:8080` | Ingest gateway listen address |
| `AGENTMON_QUERY_ADDR` | `:8081` | Query API listen address |
| `AGENTMON_UI_ADDR` | `:8082` | Web UI listen address |
| `AGENTMON_QUERY_BASE` | `http://query-api` | Query API URL (for web-ui proxy) |
| `OPENCLAW_REGISTRY` | `~/.claude/state/openclaw-instances.json` | VM instance registry |
| `POLL_INTERVAL` | `30s` | VM polling interval |
## API
### Ingest Gateway (`:8080`)
```
GET /healthz Health check
POST /v1/events Batch event ingestion (JSON array)
GET /v1/ws WebSocket event stream
```
### Query API (`:8081`)
```
GET /healthz Health check
GET /v1/events List events (?event_type=&framework=&limit=)
GET /v1/sessions List sessions (?from=&to=&framework=&host=&cursor=&limit=)
GET /v1/sessions/{id} Session detail with runs
GET /v1/runs/{id} Run detail with spans
GET /v1/stats/summary Today's aggregate stats (active sessions, runs, tools, errors by framework)
GET /v1/stats/timeseries Bucketed event counts (?window=1h|6h|24h|7d)
GET /v1/ws WebSocket live event broadcast
```
## Event Schema
Events follow the `agentmon.event` envelope format:
```json
{
"schema": { "name": "agentmon.event", "version": 1 },
"event": {
"id": "uuid",
"type": "session.start",
"ts": "2026-03-13T12:00:00Z",
"source": {
"framework": "openclaw",
"client_id": "zap",
"host": "zap"
}
},
"correlation": {
"session_id": "uuid",
"run_id": "uuid",
"span_id": "uuid"
},
"attributes": {},
"payload": {}
}
```
**Event types:** `session.start`, `session.end`, `run.start`, `run.end`, `span.start`, `span.end`, `error`, `metric.snapshot`, `openclaw.snapshot`
## Database Schema
```sql
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
ts TIMESTAMPTZ NOT NULL,
type TEXT NOT NULL,
session_id TEXT,
run_id TEXT,
trace_id TEXT,
span_id TEXT,
parent_span_id TEXT,
source_framework TEXT,
client_id TEXT,
payload JSONB NOT NULL
);
```
## OpenClaw Hook
The `hooks/agentmon/` directory contains a TypeScript hook that captures agent activity from OpenClaw instances and emits it to the ingest gateway. It maps OpenClaw events to agentmon's session/run/span model:
| OpenClaw Event | agentmon Event | Description |
|----------------|----------------|-------------|
| `command:new` | `session.start` | New conversation started |
| `command:stop` | `session.end` | Conversation ended |
| `command:reset` | `session.end` + `session.start` | Conversation reset |
| `message:received` | `run.start` | User message received |
| `message:sent` | `run.end` | Agent response sent |
| `tool_result_persist` | `span.end` | Tool call completed |
| `session:compact:before` | `span.start` | Context compaction started |
| `session:compact:after` | `span.end` | Context compaction finished |
### Deploying the hook
The hook is deployed to each VM at `~/.openclaw/hooks/agentmon/`. Two environment variables are required in `~/.openclaw/.env`:
```bash
AGENTMON_INGEST_URL=http://192.168.122.1:8080
AGENTMON_VM_NAME=zap # or orb, sun
```
Deployment is automated via Ansible — see the [swarm ansible playbook](https://gitea-http.taildb3494.ts.net/will/swarm) `playbooks/customize.yml`.
## Codex Hook
The `hooks/codex/` directory contains a TypeScript handler for Codex CLI telemetry. Current Codex support is session/run oriented:
- `sessionStart` and `sessionEnd` map to `session.start`, `run.start`, `run.end`, and `session.end`
- `notify` maps turn-complete notifications into `run.end`
- prompt-submit hooks map user prompts into the next `run.start`
- usage payloads emit both `run.end.payload.usage` and a `metric.snapshot` event
The Codex handler persists lightweight session state across hook subprocesses. If Codex only delivers later-stage hooks for a session, the handler can recover by emitting synthetic `session.start`/`run.start` events before the first `run.end` or usage snapshot. Full-fidelity lifecycle tracking still depends on configuring Codex session lifecycle hooks, not just `notify`.
Sample Codex hook configuration lives in [hooks/codex/hooks.json](/home/will/lab/agentmon/hooks/codex/hooks.json). On the local Codex CLI version we checked (`0.116.0`), `notify` is confirmed. Online reports suggest prompt-submit hooks may appear as `userpromptsubmit` or `userPromptSubmit`, so the sample config includes those aliases.
The current Codex integration does not assume tool or subagent span hooks exist. If a newer Codex CLI exposes official tool/span hooks, they can be added separately without changing the run/session flow above.
## Gemini Hook
The `hooks/gemini/` directory contains a TypeScript handler for Gemini CLI telemetry. The current integration maps Gemini hook events into agentmon's session/run/span model:
- `onStart` maps to `session.start` and an initial `run.start`
- `onStop` maps to `run.end` and `session.end`
- `onToolCall` maps to `span.start`
- `onToolResult` maps to `span.end`
Sample Gemini hook configuration lives in [hooks/gemini/hooks.json](/home/will/lab/agentmon/hooks/gemini/hooks.json). Install the handler from that directory so the `agentmon-gemini-handler` binary is available, then point Gemini CLI at the sample hook config and set `AGENTMON_INGEST_URL` to your ingest gateway.
## Go SDK
Emit events from Go applications:
```go
emitter, err := sdk.NewEmitter(sdk.Config{
ServerURL: "http://localhost:8080",
Framework: "my-agent",
ClientID: "client-001",
Host: "localhost",
})
defer emitter.Close(ctx)
emitter.Emit(ctx, sdk.NewSessionStart(sessionID, sdk.WithSource(emitter)))
emitter.Emit(ctx, sdk.NewRunStart(sessionID, runID))
emitter.Emit(ctx, sdk.NewRunEnd(sessionID, runID, sdk.WithPayload(map[string]any{
"status": "success",
"duration_ms": 1234,
})))
```
## Web UI
The web UI has five views:
- **Dashboard** (`/`) — real-time overview with summary stats (active sessions, runs, tools, errors), uPlot time-series charts with selectable windows (1h/6h/24h/7d), framework breakdown bars, live activity feed, and top tools ranking. All sections update live via WebSocket.
- **Sessions** (`/sessions`) — browse all agent sessions with date range, framework, and host filters
- **Session Detail** (`/sessions/{id}`) — view runs within a session, drill into individual runs and spans
- **Agents** (`/agents`) — live timeline of OpenClaw agent events with VM status pills and statistics
- **OpenClaw** (`/openclaw`) — real-time grid of VM health cards (state, CPU, memory, disk, gateway status, issues)
## Development
```bash
make test # run tests
make tidy # go mod tidy
make logs # docker compose logs
make down # stop everything
```
## Project Structure
```
cmd/
├── ingest-gateway/ HTTP event ingestion service
├── query-api/ REST API for querying events
├── web-ui/ SPA frontend + static assets
│ └── static/ HTML, CSS, JS
├── event-processor/ NATS → Postgres persistence
└── openclaw-monitor/ VM health polling
internal/
├── event/ Envelope types and validation
├── httpx/ HTTP response helpers
├── queue/nats/ NATS publisher and subscriber
├── store/postgres/ Database queries (sessions, runs, spans, stats)
├── sdk/ Go client library for emitting events
└── monitor/openclaw/ VM metrics collection (libvirt, SSH)
hooks/
└── agentmon/ OpenClaw hook (TypeScript)
deploy/
└── k8s/ Database schema (postgres.sql)
```