297 lines
13 KiB
Markdown
297 lines
13 KiB
Markdown
# agentmon
|
|
|
|
Telemetry and monitoring system for AI agent activity across [OpenClaw](https://openclaw.ai/) instances running on KVM virtual machines. Captures sessions, runs, tool calls, errors, and VM health metrics — viewable in a real-time web dashboard.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────────┐
|
|
│ OpenClaw VMs │
|
|
│ (zap, orb, sun) │
|
|
│ │
|
|
│ hooks/agentmon/ │
|
|
│ → handler.ts │
|
|
└──────────┬───────────────┘
|
|
│ HTTP POST
|
|
▼
|
|
┌─────────────┐ publish ┌──────────────┐
|
|
│ openclaw- │────────────▶│ NATS │
|
|
│ monitor │ │ :4222 │
|
|
│ (VM polls) │ └──────┬───────┘
|
|
└─────────────┘ │ subscribe
|
|
▼
|
|
┌──────────────────┐
|
|
│ event-processor │
|
|
└────────┬─────────┘
|
|
│ INSERT
|
|
▼
|
|
┌─────────────┐ query ┌──────────────┐ proxy ┌──────────────┐
|
|
│ web-ui │◀────────▶│ query-api │◀──────────│ browser │
|
|
│ :8082 │ │ :8081 │ └──────────────┘
|
|
└─────────────┘ └──────────────┘
|
|
▲
|
|
│
|
|
┌────────┴───────┐
|
|
│ PostgreSQL │
|
|
│ :5432 │
|
|
└────────────────┘
|
|
```
|
|
|
|
**Data flow:** OpenClaw hooks emit telemetry events over HTTP to the **ingest gateway**, which publishes them to **NATS**. The **event processor** subscribes and persists events to **PostgreSQL**. The **query API** serves aggregated data (sessions, runs, spans) to the **web UI**. A separate **openclaw-monitor** polls VM health metrics (CPU, memory, disk, service status) via libvirt and SSH.
|
|
|
|
Real-time updates flow through NATS → query-api → WebSocket → browser.
|
|
|
|
## Services
|
|
|
|
| Service | Port | Description |
|
|
|---------|------|-------------|
|
|
| **ingest-gateway** | 8080 | HTTP + WebSocket event ingestion, publishes to NATS |
|
|
| **query-api** | 8081 | REST API for sessions, runs, spans; WebSocket live feed |
|
|
| **web-ui** | 8082 | SPA frontend with reverse proxy to query-api |
|
|
| **event-processor** | — | NATS subscriber, persists events to Postgres |
|
|
| **openclaw-monitor** | — | Polls VM instances via libvirt/SSH, emits snapshots |
|
|
| **postgres** | 5432 | Event storage |
|
|
| **nats** | 4222 | Message queue (JetStream) |
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
make up
|
|
```
|
|
|
|
This starts Postgres, NATS, and all application services via Docker Compose. Open http://localhost:8082.
|
|
|
|
For local development, start infrastructure only and run services manually:
|
|
|
|
```bash
|
|
make up # postgres + nats
|
|
make run-ingest # terminal 1
|
|
make run-query # terminal 2
|
|
make run-ui # terminal 3
|
|
make run-processor # terminal 4
|
|
make run-openclaw-monitor # terminal 5
|
|
```
|
|
|
|
Or use the convenience scripts:
|
|
|
|
```bash
|
|
./start-all.sh # start everything
|
|
./stop-all.sh # stop everything
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Environment variables (see `.env.example`):
|
|
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| `DATABASE_URL` | — | Postgres connection string (required) |
|
|
| `NATS_URL` | `nats://nats:4222` | NATS server address |
|
|
| `NATS_TOPIC` | `agentmon.events.v1` | NATS topic for events |
|
|
| `AGENTMON_ADDR` | `:8080` | Ingest gateway listen address |
|
|
| `AGENTMON_QUERY_ADDR` | `:8081` | Query API listen address |
|
|
| `AGENTMON_UI_ADDR` | `:8082` | Web UI listen address |
|
|
| `AGENTMON_QUERY_BASE` | `http://query-api` | Query API URL (for web-ui proxy) |
|
|
| `OPENCLAW_REGISTRY` | `~/.claude/state/openclaw-instances.json` | VM instance registry |
|
|
| `POLL_INTERVAL` | `30s` | VM polling interval |
|
|
|
|
## API
|
|
|
|
### Ingest Gateway (`:8080`)
|
|
|
|
```
|
|
GET /healthz Health check
|
|
POST /v1/events Batch event ingestion (JSON array)
|
|
GET /v1/ws WebSocket event stream
|
|
```
|
|
|
|
### Query API (`:8081`)
|
|
|
|
```
|
|
GET /healthz Health check
|
|
GET /v1/events List events (?event_type=&framework=&limit=)
|
|
GET /v1/sessions List sessions (?from=&to=&framework=&host=&cursor=&limit=)
|
|
GET /v1/sessions/{id} Session detail with runs
|
|
GET /v1/runs/{id} Run detail with spans
|
|
GET /v1/stats/summary Today's aggregate stats (active sessions, runs, tools, errors by framework)
|
|
GET /v1/stats/timeseries Bucketed event counts (?window=1h|6h|24h|7d)
|
|
GET /v1/ws WebSocket live event broadcast
|
|
```
|
|
|
|
## Event Schema
|
|
|
|
Events follow the `agentmon.event` envelope format:
|
|
|
|
```json
|
|
{
|
|
"schema": { "name": "agentmon.event", "version": 1 },
|
|
"event": {
|
|
"id": "uuid",
|
|
"type": "session.start",
|
|
"ts": "2026-03-13T12:00:00Z",
|
|
"source": {
|
|
"framework": "openclaw",
|
|
"client_id": "zap",
|
|
"host": "zap"
|
|
}
|
|
},
|
|
"correlation": {
|
|
"session_id": "uuid",
|
|
"run_id": "uuid",
|
|
"span_id": "uuid"
|
|
},
|
|
"attributes": {},
|
|
"payload": {}
|
|
}
|
|
```
|
|
|
|
**Event types:** `session.start`, `session.end`, `run.start`, `run.end`, `span.start`, `span.end`, `error`, `metric.snapshot`, `openclaw.snapshot`
|
|
|
|
## Database Schema
|
|
|
|
```sql
|
|
CREATE TABLE events (
|
|
event_id TEXT PRIMARY KEY,
|
|
ts TIMESTAMPTZ NOT NULL,
|
|
type TEXT NOT NULL,
|
|
session_id TEXT,
|
|
run_id TEXT,
|
|
trace_id TEXT,
|
|
span_id TEXT,
|
|
parent_span_id TEXT,
|
|
source_framework TEXT,
|
|
client_id TEXT,
|
|
payload JSONB NOT NULL
|
|
);
|
|
```
|
|
|
|
## OpenClaw Hook
|
|
|
|
The `hooks/agentmon/` directory contains a TypeScript hook that captures agent activity from OpenClaw instances and emits it to the ingest gateway. It maps OpenClaw events to agentmon's session/run/span model:
|
|
|
|
| OpenClaw Event | agentmon Event | Description |
|
|
|----------------|----------------|-------------|
|
|
| `command:new` | `session.start` | New conversation started |
|
|
| `command:stop` | `session.end` | Conversation ended |
|
|
| `command:reset` | `session.end` + `session.start` | Conversation reset |
|
|
| `message:received` | `run.start` | User message received |
|
|
| `message:sent` | `run.end` | Agent response sent |
|
|
| `tool_result_persist` | `span.end` | Tool call completed |
|
|
| `session:compact:before` | `span.start` | Context compaction started |
|
|
| `session:compact:after` | `span.end` | Context compaction finished |
|
|
|
|
### Deploying the hook
|
|
|
|
The hook is deployed to each VM at `~/.openclaw/hooks/agentmon/`. Two environment variables are required in `~/.openclaw/.env`:
|
|
|
|
```bash
|
|
AGENTMON_INGEST_URL=http://192.168.122.1:8080
|
|
AGENTMON_VM_NAME=zap # or orb, sun
|
|
```
|
|
|
|
Deployment is automated via Ansible — see the [swarm ansible playbook](https://gitea-http.taildb3494.ts.net/will/swarm) `playbooks/customize.yml`.
|
|
|
|
## Codex Hook
|
|
|
|
The `hooks/codex/` directory contains a TypeScript handler for Codex CLI telemetry. Current Codex support is session/run oriented:
|
|
|
|
- `sessionStart` and `sessionEnd` map to `session.start`, `run.start`, `run.end`, and `session.end`
|
|
- `notify` maps turn-complete notifications into `run.end`
|
|
- prompt-submit hooks map user prompts into the next `run.start`
|
|
- usage payloads emit both `run.end.payload.usage` and a `metric.snapshot` event
|
|
|
|
The Codex handler persists lightweight session state across hook subprocesses. If Codex only delivers later-stage hooks for a session, the handler can recover by emitting synthetic `session.start`/`run.start` events before the first `run.end` or usage snapshot. Full-fidelity lifecycle tracking still depends on configuring Codex session lifecycle hooks, not just `notify`.
|
|
|
|
Sample Codex hook configuration lives in [hooks/codex/hooks.json](/home/will/lab/agentmon/hooks/codex/hooks.json). On the local Codex CLI version we checked (`0.116.0`), `notify` is confirmed. Online reports suggest prompt-submit hooks may appear as `userpromptsubmit` or `userPromptSubmit`, so the sample config includes those aliases.
|
|
|
|
The current Codex integration does not assume tool or subagent span hooks exist. If a newer Codex CLI exposes official tool/span hooks, they can be added separately without changing the run/session flow above.
|
|
|
|
## Gemini Hook
|
|
|
|
The `hooks/gemini/` directory contains a TypeScript handler for Gemini CLI telemetry. The current integration maps Gemini hook events into agentmon's session/run/span model:
|
|
|
|
- `onStart` maps to `session.start` and an initial `run.start`
|
|
- `onStop` maps to `run.end` and `session.end`
|
|
- `onToolCall` maps to `span.start`
|
|
- `onToolResult` maps to `span.end`
|
|
|
|
Sample Gemini hook configuration lives in [hooks/gemini/hooks.json](/home/will/lab/agentmon/hooks/gemini/hooks.json). Install the handler from that directory so the `agentmon-gemini-handler` binary is available, then point Gemini CLI at the sample hook config and set `AGENTMON_INGEST_URL` to your ingest gateway.
|
|
|
|
## Hermes Hook
|
|
|
|
The `hooks/hermes/` directory contains a TypeScript handler for Hermes Agent shell-hook telemetry. The current integration maps Hermes hook events into agentmon's session/run/span model:
|
|
|
|
- `on_session_start` maps to `session.start`
|
|
- `pre_llm_call` maps to `run.start`
|
|
- `post_llm_call` maps to `run.end`
|
|
- `pre_tool_call` maps to `span.start`
|
|
- `post_tool_call` maps to `span.end`
|
|
- `post_api_request` maps usage payloads to `metric.snapshot`
|
|
- `on_session_finalize` maps to `session.end`
|
|
|
|
Sample Hermes hook configuration lives in [hooks/hermes/hooks.yaml](/home/will/lab/agentmon/hooks/hermes/hooks.yaml). Install the handler from that directory so the `agentmon-hermes-handler` binary is available, then merge the sample `hooks:` block into `~/.hermes/config.yaml` and set `AGENTMON_INGEST_URL` to your ingest gateway.
|
|
|
|
## Go SDK
|
|
|
|
Emit events from Go applications:
|
|
|
|
```go
|
|
emitter, err := sdk.NewEmitter(sdk.Config{
|
|
ServerURL: "http://localhost:8080",
|
|
Framework: "my-agent",
|
|
ClientID: "client-001",
|
|
Host: "localhost",
|
|
})
|
|
defer emitter.Close(ctx)
|
|
|
|
emitter.Emit(ctx, sdk.NewSessionStart(sessionID, sdk.WithSource(emitter)))
|
|
emitter.Emit(ctx, sdk.NewRunStart(sessionID, runID))
|
|
emitter.Emit(ctx, sdk.NewRunEnd(sessionID, runID, sdk.WithPayload(map[string]any{
|
|
"status": "success",
|
|
"duration_ms": 1234,
|
|
})))
|
|
```
|
|
|
|
## Web UI
|
|
|
|
The web UI has five views:
|
|
|
|
- **Dashboard** (`/`) — real-time overview with summary stats (active sessions, runs, tools, errors), uPlot time-series charts with selectable windows (1h/6h/24h/7d), framework breakdown bars, live activity feed, and top tools ranking. All sections update live via WebSocket.
|
|
- **Sessions** (`/sessions`) — browse all agent sessions with date range, framework, and host filters
|
|
- **Session Detail** (`/sessions/{id}`) — view runs within a session, drill into individual runs and spans
|
|
- **Agents** (`/agents`) — live timeline of OpenClaw agent events with VM status pills and statistics
|
|
- **OpenClaw** (`/openclaw`) — real-time grid of VM health cards (state, CPU, memory, disk, gateway status, issues)
|
|
|
|
## Development
|
|
|
|
```bash
|
|
make test # run tests
|
|
make tidy # go mod tidy
|
|
make logs # docker compose logs
|
|
make down # stop everything
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
cmd/
|
|
├── ingest-gateway/ HTTP event ingestion service
|
|
├── query-api/ REST API for querying events
|
|
├── web-ui/ SPA frontend + static assets
|
|
│ └── static/ HTML, CSS, JS
|
|
├── event-processor/ NATS → Postgres persistence
|
|
└── openclaw-monitor/ VM health polling
|
|
internal/
|
|
├── event/ Envelope types and validation
|
|
├── httpx/ HTTP response helpers
|
|
├── queue/nats/ NATS publisher and subscriber
|
|
├── store/postgres/ Database queries (sessions, runs, spans, stats)
|
|
├── sdk/ Go client library for emitting events
|
|
└── monitor/openclaw/ VM metrics collection (libvirt, SSH)
|
|
hooks/
|
|
└── agentmon/ OpenClaw hook (TypeScript)
|
|
deploy/
|
|
└── k8s/ Database schema (postgres.sql)
|
|
```
|