# agentmon — design (2026-01-16) ## Goals agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models. Primary requirements - Collect telemetry from OpenCode and Claude Code on local machines. - Ship telemetry to cluster over Tailnet + LAN. - Ingestion supports **WebSocket + HTTP**. - UI is a **custom web UI** (not Grafana). - Storage: SQLite for dev/small; Postgres for production. - Retention: keep forever by default; allow optional cleanup tooling. - No app-level auth (trusted network); enforce access with network-layer controls. Non-goals - Distributed tracing compatibility (OTel) in v1; we may map later. - “Perfect” cost computation across all providers in v1; we store enough fields to recompute. ## Event Model ### Core concepts - **Session**: a human-initiated interactive timeframe (e.g., one terminal session). - **Run**: a single user invocation that triggers an agent workflow (a command, task, or skill execution). - **Trace**: a logical end-to-end chain of work (often equals a run, but can span). - **Span**: a timed unit of work (tool call, model call, indexing job, etc.). - **Event**: an append-only record (span start/end, run start/end, error, metric snapshot). ### Envelope (common to all events) All events are JSON objects. Fields marked **required** are required for every event type. - `schema` (**required**): `{ name: "agentmon.event", version: 1 }` - `event` (**required**): - `id` (**required**): stable event UUID (UUIDv7 recommended) - `type` (**required**): one of: - `session.start`, `session.end` - `run.start`, `run.end` - `span.start`, `span.end` - `error` - `metric.snapshot` - `ts` (**required**): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK) - `seq` (optional): monotonic sequence per connection (enables gap detection + ACKing) - `source` (**required**): - `framework` (**required**): `opencode` | `claude-code` | `other` - `client_id` (**required**): stable id for the emitter install (per-machine is fine) - `host` (**required**): hostname - `user` (optional): local username - `version` (optional): emitter/SDK version - `correlation` (optional but strongly recommended): - `session_id` (recommended) - `run_id` (recommended) - `trace_id` (recommended) - `span_id` (required for span events) - `parent_span_id` (optional) - `attributes` (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.) - `payload` (optional): event-type-specific object (see below) ### Event types (minimum v1) #### `session.start` / `session.end` - Required: `correlation.session_id` - Recommended `attributes`: `framework_session` (native id if exists), `cwd`, `repo`, `branch` #### `run.start` / `run.end` - Required: `correlation.session_id`, `correlation.run_id` - Recommended `attributes`: `command`, `agent`, `workflow`, `prompt_hash?` - `run.end` may include aggregate `payload.usage` (token totals, cost totals, status) #### `span.start` / `span.end` - Required: `correlation.trace_id`, `correlation.span_id` - Recommended `attributes`: `span_kind` (`llm`|`tool`|`skill`|`internal`), `name` - `span.start` `payload`: `{ start_ts }` (or omit if you trust `event.ts`) - `span.end` `payload`: `{ end_ts, status, duration_ms?, llm?, error? }` #### `error` - Required: `payload.error` - Recommended: attach correlation ids when known (session/run/trace/span) #### `metric.snapshot` - Used for periodic gauges (queue lag, emitter buffer size, etc.) - `payload.metrics`: map of numeric values ### LLM usage fields (when applicable) Stored on `span.end` (and optionally on `run.end` aggregates): - `llm.model`: provider/model string - `llm.usage`: `{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }` - `llm.cost`: `{ input_usd?, output_usd?, total_usd? }` - `llm.finish_reason?` ### Errors - `error`: `{ type, message, code?, retryable?, source? }` - Errors can be standalone `error` events or embedded on `span.end`/`run.end`. ## Ingestion ### WebSocket (primary/live) - Endpoint: `GET /v1/ws` - Client sends one JSON event per WS message (simplest) or small batches. - Server acks in-order sequences: - Client includes `event.seq` (monotonic per connection) - Server responds: `{ "ack": { "up_to_seq": N } }` - Reconnect behavior: - Client reconnects with `?client_id=...` - Client replays any unacked events (idempotent by `event.id`) ### HTTP (batch/backfill) - Endpoint: `POST /v1/events` - Body: JSON array of events. - Response: `{ accepted: , rejected: , errors?: [...] }` ### Validation - Gateway validates `schema.name/version`, required envelope fields, and basic typing. - Non-strict mode (v1): allow unknown fields, store raw `payload`/`attributes`. ### Idempotency - Every event has stable `event.id` (UUIDv7 recommended). - Storage enforces unique `(event_id)` so retries + replays are safe. ### Backpressure - WS gateway may send `{ "control": { "slow_down_ms": 250 } }`. - If overloaded, close WS with a retryable close code; client backs off. - Emitters cap in-memory buffers and may optionally spool to disk. ## Service Architecture (microservices) ### v1 recommended services 1) **ingest-gateway** - Exposes WS + HTTP endpoints. - Validates schema, assigns arrival timestamps, and publishes to queue. - Stateless; horizontal scaling. 2) **event-processor** - Consumes from queue. - Deduplicates by `event.id` and writes to DB. - Builds/updates rollup tables. 3) **query-api** - Read-only API used by the UI. - Performs filtered queries and serves aggregates. 4) **web-ui** - Custom frontend. - Talks only to query-api. 5) **retention-job** (CronJob) - Optional cleanup/compression policies. ### Queue choice - Preferred: **NATS JetStream** (durable stream, consumer groups). - Stream: `agentmon_events` - Subject pattern: `agentmon.events.v1` Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open. ## Kubernetes deployment outline Namespace - `agentmon` Workloads - Deployments: - `ingest-gateway` (Service `ingest-gateway`) - `event-processor` (no Service; talks to NATS + DB) - `query-api` (Service `query-api`) - `web-ui` (Service `web-ui`) - Stateful: - Postgres (or external managed); PVC via `longhorn` - NATS JetStream (or external); PVC via `longhorn` - CronJob: - `retention-job` Ingress / exposure - `web-ui` and `ingest-gateway` exposed via Ingress with Tailnet/LAN allowlisting. - `query-api` internal-only (ClusterIP), only reachable from `web-ui`. NetworkPolicies (default-deny) - Allow `web-ui` -> `query-api` TCP 80/443 - Allow `query-api` -> Postgres TCP 5432 - Allow `ingest-gateway` -> NATS TCP 4222 (and JetStream as needed) - Allow `event-processor` -> NATS + Postgres - Allow ingress-controller namespace -> `web-ui`/`ingest-gateway` Services Config/Secrets - `DATABASE_URL` for processor/query-api - NATS connection URL for ingest/processor - (Optional) per-emitter shared secret, if you later decide to add lightweight auth ## Storage ### Postgres (production) Suggested tables (minimal): - `events`: append-only canonical storage - columns: `event_id` (pk), `ts`, `type`, `session_id`, `run_id`, `trace_id`, `span_id`, `parent_span_id`, `source_framework`, `client_id`, `payload_jsonb` - indexes on `(ts)`, `(session_id)`, `(run_id)`, `(type, ts)` - `runs`: derived/rollup - status, start/end ts, token totals, cost totals, error counts - `sessions`: derived/rollup - start/end ts, host/user, run counts ### SQLite (dev/small) - Same logical schema; keep SQL portable. - JSON stored as TEXT; use generated columns where supported. ## UI (MVP pages) 1) Overview - Total tokens/cost today/7d/30d - Latency and error rate trends - Top agents/skills by cost and failures 2) Sessions - Filter by date, host, agent, framework - Drilldown into a session timeline 3) Run detail - Waterfall view of spans - Per-span tokens/cost/latency - Error stack and retry chain 4) Errors - Aggregations by error type/source - Most common failing skills/tools 5) Agents - Leaderboard: cost, tokens/sec, success rate - Regression detection (compare last 24h vs baseline) ## Security model (no app auth) - Expose services only on Tailnet/LAN. - Ingress restricted to known source CIDRs / tailscale ingress. - K8s NetworkPolicies: - allow UI -> query-api - allow query-api -> DB - allow ingest-gateway from tailnet ingress only - deny all else by default ## Operational concerns ### Observability agentmon should expose its own metrics: - ingest rate, queue lag, processor throughput - DB write latency - dropped events, invalid schema counts ### Testing - Contract tests for event schema validation. - Ingestion idempotency tests (replay same event). - Processor tests for rollups. - UI smoke tests for main dashboards. ## Open questions - Do we want to support partial OTel mapping (trace/span ids) early? - Do we want per-client local disk spool as part of the “official” SDK? - How strict should schema validation be (reject unknown fields vs allow)?