feat: complete agent monitoring - hook, UI, and backend filter

- Add event_type and framework filters to events query endpoint
- Add /agents SPA route to web-ui server
- Add Agents nav link and route in frontend
- Add agents page CSS (timeline, VM pills, stats panel)
- Build VM status strip, activity timeline, and real-time stats
- Add agentmon hook for OpenClaw (HOOK.md + handler.ts)
- Add docker-compose, Dockerfile, and supporting infra files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
William Valentin
2026-03-14 00:26:42 -07:00
parent 1927ec6622
commit 3434db3c59
29 changed files with 6228 additions and 231 deletions
+253
View File
@@ -0,0 +1,253 @@
# agentmon
Telemetry and monitoring system for AI agent activity across [OpenClaw](https://openclaw.ai/) instances running on KVM virtual machines. Captures sessions, runs, tool calls, errors, and VM health metrics — viewable in a real-time web dashboard.
## Architecture
```
┌──────────────────────────┐
│ OpenClaw VMs │
│ (zap, orb, sun) │
│ │
│ hooks/agentmon/ │
│ → handler.ts │
└──────────┬───────────────┘
│ HTTP POST
┌─────────────┐ publish ┌──────────────┐
│ openclaw- │────────────▶│ NATS │
│ monitor │ │ :4222 │
│ (VM polls) │ └──────┬───────┘
└─────────────┘ │ subscribe
┌──────────────────┐
│ event-processor │
└────────┬─────────┘
│ INSERT
┌─────────────┐ query ┌──────────────┐ proxy ┌──────────────┐
│ web-ui │◀────────▶│ query-api │◀──────────│ browser │
│ :8082 │ │ :8081 │ └──────────────┘
└─────────────┘ └──────────────┘
┌────────┴───────┐
│ PostgreSQL │
│ :5432 │
└────────────────┘
```
**Data flow:** OpenClaw hooks emit telemetry events over HTTP to the **ingest gateway**, which publishes them to **NATS**. The **event processor** subscribes and persists events to **PostgreSQL**. The **query API** serves aggregated data (sessions, runs, spans) to the **web UI**. A separate **openclaw-monitor** polls VM health metrics (CPU, memory, disk, service status) via libvirt and SSH.
Real-time updates flow through NATS → query-api → WebSocket → browser.
## Services
| Service | Port | Description |
|---------|------|-------------|
| **ingest-gateway** | 8080 | HTTP + WebSocket event ingestion, publishes to NATS |
| **query-api** | 8081 | REST API for sessions, runs, spans; WebSocket live feed |
| **web-ui** | 8082 | SPA frontend with reverse proxy to query-api |
| **event-processor** | — | NATS subscriber, persists events to Postgres |
| **openclaw-monitor** | — | Polls VM instances via libvirt/SSH, emits snapshots |
| **postgres** | 5432 | Event storage |
| **nats** | 4222 | Message queue (JetStream) |
## Quick Start
```bash
cp .env.example .env
make up
```
This starts Postgres, NATS, and all application services via Docker Compose. Open http://localhost:8082.
For local development, start infrastructure only and run services manually:
```bash
make up # postgres + nats
make run-ingest # terminal 1
make run-query # terminal 2
make run-ui # terminal 3
make run-processor # terminal 4
make run-openclaw-monitor # terminal 5
```
Or use the convenience scripts:
```bash
./start-all.sh # start everything
./stop-all.sh # stop everything
```
## Configuration
Environment variables (see `.env.example`):
| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | — | Postgres connection string (required) |
| `NATS_URL` | `nats://nats:4222` | NATS server address |
| `NATS_TOPIC` | `agentmon.events.v1` | NATS topic for events |
| `AGENTMON_ADDR` | `:8080` | Ingest gateway listen address |
| `AGENTMON_QUERY_ADDR` | `:8081` | Query API listen address |
| `AGENTMON_UI_ADDR` | `:8082` | Web UI listen address |
| `AGENTMON_QUERY_BASE` | `http://query-api` | Query API URL (for web-ui proxy) |
| `OPENCLAW_REGISTRY` | `~/.claude/state/openclaw-instances.json` | VM instance registry |
| `POLL_INTERVAL` | `30s` | VM polling interval |
## API
### Ingest Gateway (`:8080`)
```
GET /healthz Health check
POST /v1/events Batch event ingestion (JSON array)
GET /v1/ws WebSocket event stream
```
### Query API (`:8081`)
```
GET /healthz Health check
GET /v1/events List events (?event_type=&framework=&limit=)
GET /v1/sessions List sessions (?from=&to=&framework=&host=&cursor=&limit=)
GET /v1/sessions/{id} Session detail with runs
GET /v1/runs/{id} Run detail with spans
GET /v1/ws WebSocket live event broadcast
```
## Event Schema
Events follow the `agentmon.event` envelope format:
```json
{
"schema": { "name": "agentmon.event", "version": 1 },
"event": {
"id": "uuid",
"type": "session.start",
"ts": "2026-03-13T12:00:00Z",
"source": {
"framework": "openclaw",
"client_id": "zap",
"host": "zap"
}
},
"correlation": {
"session_id": "uuid",
"run_id": "uuid",
"span_id": "uuid"
},
"attributes": {},
"payload": {}
}
```
**Event types:** `session.start`, `session.end`, `run.start`, `run.end`, `span.start`, `span.end`, `error`, `metric.snapshot`, `openclaw.snapshot`
## Database Schema
```sql
CREATE TABLE events (
event_id TEXT PRIMARY KEY,
ts TIMESTAMPTZ NOT NULL,
type TEXT NOT NULL,
session_id TEXT,
run_id TEXT,
trace_id TEXT,
span_id TEXT,
parent_span_id TEXT,
source_framework TEXT,
client_id TEXT,
payload JSONB NOT NULL
);
```
## OpenClaw Hook
The `hooks/agentmon/` directory contains a TypeScript hook that captures agent activity from OpenClaw instances and emits it to the ingest gateway. It maps OpenClaw events to agentmon's session/run/span model:
| OpenClaw Event | agentmon Event | Description |
|----------------|----------------|-------------|
| `command:new` | `session.start` | New conversation started |
| `command:stop` | `session.end` | Conversation ended |
| `command:reset` | `session.end` + `session.start` | Conversation reset |
| `message:received` | `run.start` | User message received |
| `message:sent` | `run.end` | Agent response sent |
| `tool_result_persist` | `span.end` | Tool call completed |
| `session:compact:before` | `span.start` | Context compaction started |
| `session:compact:after` | `span.end` | Context compaction finished |
### Deploying the hook
The hook is deployed to each VM at `~/.openclaw/hooks/agentmon/`. Two environment variables are required in `~/.openclaw/.env`:
```bash
AGENTMON_INGEST_URL=http://192.168.122.1:8080
AGENTMON_VM_NAME=zap # or orb, sun
```
Deployment is automated via Ansible — see the [swarm ansible playbook](https://gitea-http.taildb3494.ts.net/will/swarm) `playbooks/customize.yml`.
## Go SDK
Emit events from Go applications:
```go
emitter, err := sdk.NewEmitter(sdk.Config{
ServerURL: "http://localhost:8080",
Framework: "my-agent",
ClientID: "client-001",
Host: "localhost",
})
defer emitter.Close(ctx)
emitter.Emit(ctx, sdk.NewSessionStart(sessionID, sdk.WithSource(emitter)))
emitter.Emit(ctx, sdk.NewRunStart(sessionID, runID))
emitter.Emit(ctx, sdk.NewRunEnd(sessionID, runID, sdk.WithPayload(map[string]any{
"status": "success",
"duration_ms": 1234,
})))
```
## Web UI
The dashboard has four views:
- **Sessions** — browse all agent sessions with date range and framework filters
- **Session Detail** — view runs within a session, drill into individual runs
- **OpenClaw** — real-time grid of VM health cards (state, CPU, memory, disk, issues)
- **Agents** — live timeline of agent events with statistics (message counts, tool usage, errors)
## Development
```bash
make test # run tests
make tidy # go mod tidy
make logs # docker compose logs
make down # stop everything
```
## Project Structure
```
cmd/
├── ingest-gateway/ HTTP event ingestion service
├── query-api/ REST API for querying events
├── web-ui/ SPA frontend + static assets
│ └── static/ HTML, CSS, JS
├── event-processor/ NATS → Postgres persistence
└── openclaw-monitor/ VM health polling
internal/
├── event/ Envelope types and validation
├── httpx/ HTTP response helpers
├── queue/nats/ NATS publisher and subscriber
├── store/postgres/ Database queries (sessions, runs, spans)
├── sdk/ Go client library for emitting events
└── monitor/openclaw/ VM metrics collection (libvirt, SSH)
hooks/
└── agentmon/ OpenClaw hook (TypeScript)
deploy/
└── k8s/ Database schema (postgres.sql)
```