agentmon/docs/plans/2026-01-16-agentmon-design.md

# agentmon — design (2026-01-16)

## Goals

agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.

Primary requirements
- Collect telemetry from OpenCode and Claude Code on local machines.
- Ship telemetry to cluster over Tailnet + LAN.
- Ingestion supports **WebSocket + HTTP**.
- UI is a **custom web UI** (not Grafana).
- Storage: SQLite for dev/small; Postgres for production.
- Retention: keep forever by default; allow optional cleanup tooling.
- No app-level auth (trusted network); enforce access with network-layer controls.

Non-goals
- Distributed tracing compatibility (OTel) in v1; we may map later.
- “Perfect” cost computation across all providers in v1; we store enough fields to recompute.

## Event Model

### Core concepts
- **Session**: a human-initiated interactive timeframe (e.g., one terminal session).
- **Run**: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
- **Trace**: a logical end-to-end chain of work (often equals a run, but can span).
- **Span**: a timed unit of work (tool call, model call, indexing job, etc.).
- **Event**: an append-only record (span start/end, run start/end, error, metric snapshot).

### Envelope (common to all events)
All events are JSON objects. Fields marked **required** are required for every event type.

- `schema` (**required**): `{ name: "agentmon.event", version: 1 }`
- `event` (**required**):
  - `id` (**required**): stable event UUID (UUIDv7 recommended)
  - `type` (**required**): one of:
    - `session.start`, `session.end`
    - `run.start`, `run.end`
    - `span.start`, `span.end`
    - `error`
    - `metric.snapshot`
  - `ts` (**required**): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
  - `seq` (optional): monotonic sequence per connection (enables gap detection + ACKing)
  - `source` (**required**):
    - `framework` (**required**): `opencode` | `claude-code` | `other`
    - `client_id` (**required**): stable id for the emitter install (per-machine is fine)
    - `host` (**required**): hostname
    - `user` (optional): local username
    - `version` (optional): emitter/SDK version
- `correlation` (optional but strongly recommended):
  - `session_id` (recommended)
  - `run_id` (recommended)
  - `trace_id` (recommended)
  - `span_id` (required for span events)
  - `parent_span_id` (optional)
- `attributes` (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
- `payload` (optional): event-type-specific object (see below)

### Event types (minimum v1)

#### `session.start` / `session.end`
- Required: `correlation.session_id`
- Recommended `attributes`: `framework_session` (native id if exists), `cwd`, `repo`, `branch`

#### `run.start` / `run.end`
- Required: `correlation.session_id`, `correlation.run_id`
- Recommended `attributes`: `command`, `agent`, `workflow`, `prompt_hash?`
- `run.end` may include aggregate `payload.usage` (token totals, cost totals, status)

#### `span.start` / `span.end`
- Required: `correlation.trace_id`, `correlation.span_id`
- Recommended `attributes`: `span_kind` (`llm`|`tool`|`skill`|`internal`), `name`
- `span.start` `payload`: `{ start_ts }` (or omit if you trust `event.ts`)
- `span.end` `payload`: `{ end_ts, status, duration_ms?, llm?, error? }`

#### `error`
- Required: `payload.error`
- Recommended: attach correlation ids when known (session/run/trace/span)

#### `metric.snapshot`
- Used for periodic gauges (queue lag, emitter buffer size, etc.)
- `payload.metrics`: map of numeric values

### LLM usage fields (when applicable)
Stored on `span.end` (and optionally on `run.end` aggregates):
- `llm.model`: provider/model string
- `llm.usage`: `{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }`
- `llm.cost`: `{ input_usd?, output_usd?, total_usd? }`
- `llm.finish_reason?`

### Errors
- `error`: `{ type, message, code?, retryable?, source? }`
- Errors can be standalone `error` events or embedded on `span.end`/`run.end`.

## Ingestion

### WebSocket (primary/live)
- Endpoint: `GET /v1/ws`
- Client sends one JSON event per WS message (simplest) or small batches.
- Server acks in-order sequences:
  - Client includes `event.seq` (monotonic per connection)
  - Server responds: `{ "ack": { "up_to_seq": N } }`
- Reconnect behavior:
  - Client reconnects with `?client_id=...`
  - Client replays any unacked events (idempotent by `event.id`)

### HTTP (batch/backfill)
- Endpoint: `POST /v1/events`
- Body: JSON array of events.
- Response: `{ accepted: <n>, rejected: <n>, errors?: [...] }`

### Validation
- Gateway validates `schema.name/version`, required envelope fields, and basic typing.
- Non-strict mode (v1): allow unknown fields, store raw `payload`/`attributes`.

### Idempotency
- Every event has stable `event.id` (UUIDv7 recommended).
- Storage enforces unique `(event_id)` so retries + replays are safe.

### Backpressure
- WS gateway may send `{ "control": { "slow_down_ms": 250 } }`.
- If overloaded, close WS with a retryable close code; client backs off.
- Emitters cap in-memory buffers and may optionally spool to disk.

## Service Architecture (microservices)

### v1 recommended services
1) **ingest-gateway**
- Exposes WS + HTTP endpoints.
- Validates schema, assigns arrival timestamps, and publishes to queue.
- Stateless; horizontal scaling.

2) **event-processor**
- Consumes from queue.
- Deduplicates by `event.id` and writes to DB.
- Builds/updates rollup tables.

3) **query-api**
- Read-only API used by the UI.
- Performs filtered queries and serves aggregates.

4) **web-ui**
- Custom frontend.
- Talks only to query-api.

5) **retention-job** (CronJob)
- Optional cleanup/compression policies.

### Queue choice
- Preferred: **NATS JetStream** (durable stream, consumer groups).
- Stream: `agentmon_events`
- Subject pattern: `agentmon.events.v1`

Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.

## Kubernetes deployment outline

Namespace
- `agentmon`

Workloads
- Deployments:
  - `ingest-gateway` (Service `ingest-gateway`)
  - `event-processor` (no Service; talks to NATS + DB)
  - `query-api` (Service `query-api`)
  - `web-ui` (Service `web-ui`)
- Stateful:
  - Postgres (or external managed); PVC via `longhorn`
  - NATS JetStream (or external); PVC via `longhorn`
- CronJob:
  - `retention-job`

Ingress / exposure
- `web-ui` and `ingest-gateway` exposed via Ingress with Tailnet/LAN allowlisting.
- `query-api` internal-only (ClusterIP), only reachable from `web-ui`.

NetworkPolicies (default-deny)
- Allow `web-ui` -> `query-api` TCP 80/443
- Allow `query-api` -> Postgres TCP 5432
- Allow `ingest-gateway` -> NATS TCP 4222 (and JetStream as needed)
- Allow `event-processor` -> NATS + Postgres
- Allow ingress-controller namespace -> `web-ui`/`ingest-gateway` Services

Config/Secrets
- `DATABASE_URL` for processor/query-api
- NATS connection URL for ingest/processor
- (Optional) per-emitter shared secret, if you later decide to add lightweight auth

## Storage

### Postgres (production)
Suggested tables (minimal):
- `events`: append-only canonical storage
  - columns: `event_id` (pk), `ts`, `type`, `session_id`, `run_id`, `trace_id`, `span_id`, `parent_span_id`, `source_framework`, `client_id`, `payload_jsonb`
  - indexes on `(ts)`, `(session_id)`, `(run_id)`, `(type, ts)`
- `runs`: derived/rollup
  - status, start/end ts, token totals, cost totals, error counts
- `sessions`: derived/rollup
  - start/end ts, host/user, run counts

### SQLite (dev/small)
- Same logical schema; keep SQL portable.
- JSON stored as TEXT; use generated columns where supported.

## UI (MVP pages)

1) Overview
- Total tokens/cost today/7d/30d
- Latency and error rate trends
- Top agents/skills by cost and failures

2) Sessions
- Filter by date, host, agent, framework
- Drilldown into a session timeline

3) Run detail
- Waterfall view of spans
- Per-span tokens/cost/latency
- Error stack and retry chain

4) Errors
- Aggregations by error type/source
- Most common failing skills/tools

5) Agents
- Leaderboard: cost, tokens/sec, success rate
- Regression detection (compare last 24h vs baseline)

## Security model (no app auth)
- Expose services only on Tailnet/LAN.
- Ingress restricted to known source CIDRs / tailscale ingress.
- K8s NetworkPolicies:
  - allow UI -> query-api
  - allow query-api -> DB
  - allow ingest-gateway from tailnet ingress only
  - deny all else by default

## Operational concerns

### Observability
agentmon should expose its own metrics:
- ingest rate, queue lag, processor throughput
- DB write latency
- dropped events, invalid schema counts

### Testing
- Contract tests for event schema validation.
- Ingestion idempotency tests (replay same event).
- Processor tests for rollups.
- UI smoke tests for main dashboards.

## Open questions
- Do we want to support partial OTel mapping (trace/span ids) early?
- Do we want per-client local disk spool as part of the “official” SDK?
- How strict should schema validation be (reject unknown fields vs allow)?