Files
agentmon/docs/plans/2026-01-16-agentmon-design.md
T
William Valentin 256b841cbf feat: scaffold agentmon services and k8s deploy
Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.
2026-01-17 01:06:57 -08:00

255 lines
9.1 KiB
Markdown

# agentmon — design (2026-01-16)
## Goals
agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.
Primary requirements
- Collect telemetry from OpenCode and Claude Code on local machines.
- Ship telemetry to cluster over Tailnet + LAN.
- Ingestion supports **WebSocket + HTTP**.
- UI is a **custom web UI** (not Grafana).
- Storage: SQLite for dev/small; Postgres for production.
- Retention: keep forever by default; allow optional cleanup tooling.
- No app-level auth (trusted network); enforce access with network-layer controls.
Non-goals
- Distributed tracing compatibility (OTel) in v1; we may map later.
- “Perfect” cost computation across all providers in v1; we store enough fields to recompute.
## Event Model
### Core concepts
- **Session**: a human-initiated interactive timeframe (e.g., one terminal session).
- **Run**: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
- **Trace**: a logical end-to-end chain of work (often equals a run, but can span).
- **Span**: a timed unit of work (tool call, model call, indexing job, etc.).
- **Event**: an append-only record (span start/end, run start/end, error, metric snapshot).
### Envelope (common to all events)
All events are JSON objects. Fields marked **required** are required for every event type.
- `schema` (**required**): `{ name: "agentmon.event", version: 1 }`
- `event` (**required**):
- `id` (**required**): stable event UUID (UUIDv7 recommended)
- `type` (**required**): one of:
- `session.start`, `session.end`
- `run.start`, `run.end`
- `span.start`, `span.end`
- `error`
- `metric.snapshot`
- `ts` (**required**): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
- `seq` (optional): monotonic sequence per connection (enables gap detection + ACKing)
- `source` (**required**):
- `framework` (**required**): `opencode` | `claude-code` | `other`
- `client_id` (**required**): stable id for the emitter install (per-machine is fine)
- `host` (**required**): hostname
- `user` (optional): local username
- `version` (optional): emitter/SDK version
- `correlation` (optional but strongly recommended):
- `session_id` (recommended)
- `run_id` (recommended)
- `trace_id` (recommended)
- `span_id` (required for span events)
- `parent_span_id` (optional)
- `attributes` (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
- `payload` (optional): event-type-specific object (see below)
### Event types (minimum v1)
#### `session.start` / `session.end`
- Required: `correlation.session_id`
- Recommended `attributes`: `framework_session` (native id if exists), `cwd`, `repo`, `branch`
#### `run.start` / `run.end`
- Required: `correlation.session_id`, `correlation.run_id`
- Recommended `attributes`: `command`, `agent`, `workflow`, `prompt_hash?`
- `run.end` may include aggregate `payload.usage` (token totals, cost totals, status)
#### `span.start` / `span.end`
- Required: `correlation.trace_id`, `correlation.span_id`
- Recommended `attributes`: `span_kind` (`llm`|`tool`|`skill`|`internal`), `name`
- `span.start` `payload`: `{ start_ts }` (or omit if you trust `event.ts`)
- `span.end` `payload`: `{ end_ts, status, duration_ms?, llm?, error? }`
#### `error`
- Required: `payload.error`
- Recommended: attach correlation ids when known (session/run/trace/span)
#### `metric.snapshot`
- Used for periodic gauges (queue lag, emitter buffer size, etc.)
- `payload.metrics`: map of numeric values
### LLM usage fields (when applicable)
Stored on `span.end` (and optionally on `run.end` aggregates):
- `llm.model`: provider/model string
- `llm.usage`: `{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }`
- `llm.cost`: `{ input_usd?, output_usd?, total_usd? }`
- `llm.finish_reason?`
### Errors
- `error`: `{ type, message, code?, retryable?, source? }`
- Errors can be standalone `error` events or embedded on `span.end`/`run.end`.
## Ingestion
### WebSocket (primary/live)
- Endpoint: `GET /v1/ws`
- Client sends one JSON event per WS message (simplest) or small batches.
- Server acks in-order sequences:
- Client includes `event.seq` (monotonic per connection)
- Server responds: `{ "ack": { "up_to_seq": N } }`
- Reconnect behavior:
- Client reconnects with `?client_id=...`
- Client replays any unacked events (idempotent by `event.id`)
### HTTP (batch/backfill)
- Endpoint: `POST /v1/events`
- Body: JSON array of events.
- Response: `{ accepted: <n>, rejected: <n>, errors?: [...] }`
### Validation
- Gateway validates `schema.name/version`, required envelope fields, and basic typing.
- Non-strict mode (v1): allow unknown fields, store raw `payload`/`attributes`.
### Idempotency
- Every event has stable `event.id` (UUIDv7 recommended).
- Storage enforces unique `(event_id)` so retries + replays are safe.
### Backpressure
- WS gateway may send `{ "control": { "slow_down_ms": 250 } }`.
- If overloaded, close WS with a retryable close code; client backs off.
- Emitters cap in-memory buffers and may optionally spool to disk.
## Service Architecture (microservices)
### v1 recommended services
1) **ingest-gateway**
- Exposes WS + HTTP endpoints.
- Validates schema, assigns arrival timestamps, and publishes to queue.
- Stateless; horizontal scaling.
2) **event-processor**
- Consumes from queue.
- Deduplicates by `event.id` and writes to DB.
- Builds/updates rollup tables.
3) **query-api**
- Read-only API used by the UI.
- Performs filtered queries and serves aggregates.
4) **web-ui**
- Custom frontend.
- Talks only to query-api.
5) **retention-job** (CronJob)
- Optional cleanup/compression policies.
### Queue choice
- Preferred: **NATS JetStream** (durable stream, consumer groups).
- Stream: `agentmon_events`
- Subject pattern: `agentmon.events.v1`
Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.
## Kubernetes deployment outline
Namespace
- `agentmon`
Workloads
- Deployments:
- `ingest-gateway` (Service `ingest-gateway`)
- `event-processor` (no Service; talks to NATS + DB)
- `query-api` (Service `query-api`)
- `web-ui` (Service `web-ui`)
- Stateful:
- Postgres (or external managed); PVC via `longhorn`
- NATS JetStream (or external); PVC via `longhorn`
- CronJob:
- `retention-job`
Ingress / exposure
- `web-ui` and `ingest-gateway` exposed via Ingress with Tailnet/LAN allowlisting.
- `query-api` internal-only (ClusterIP), only reachable from `web-ui`.
NetworkPolicies (default-deny)
- Allow `web-ui` -> `query-api` TCP 80/443
- Allow `query-api` -> Postgres TCP 5432
- Allow `ingest-gateway` -> NATS TCP 4222 (and JetStream as needed)
- Allow `event-processor` -> NATS + Postgres
- Allow ingress-controller namespace -> `web-ui`/`ingest-gateway` Services
Config/Secrets
- `DATABASE_URL` for processor/query-api
- NATS connection URL for ingest/processor
- (Optional) per-emitter shared secret, if you later decide to add lightweight auth
## Storage
### Postgres (production)
Suggested tables (minimal):
- `events`: append-only canonical storage
- columns: `event_id` (pk), `ts`, `type`, `session_id`, `run_id`, `trace_id`, `span_id`, `parent_span_id`, `source_framework`, `client_id`, `payload_jsonb`
- indexes on `(ts)`, `(session_id)`, `(run_id)`, `(type, ts)`
- `runs`: derived/rollup
- status, start/end ts, token totals, cost totals, error counts
- `sessions`: derived/rollup
- start/end ts, host/user, run counts
### SQLite (dev/small)
- Same logical schema; keep SQL portable.
- JSON stored as TEXT; use generated columns where supported.
## UI (MVP pages)
1) Overview
- Total tokens/cost today/7d/30d
- Latency and error rate trends
- Top agents/skills by cost and failures
2) Sessions
- Filter by date, host, agent, framework
- Drilldown into a session timeline
3) Run detail
- Waterfall view of spans
- Per-span tokens/cost/latency
- Error stack and retry chain
4) Errors
- Aggregations by error type/source
- Most common failing skills/tools
5) Agents
- Leaderboard: cost, tokens/sec, success rate
- Regression detection (compare last 24h vs baseline)
## Security model (no app auth)
- Expose services only on Tailnet/LAN.
- Ingress restricted to known source CIDRs / tailscale ingress.
- K8s NetworkPolicies:
- allow UI -> query-api
- allow query-api -> DB
- allow ingest-gateway from tailnet ingress only
- deny all else by default
## Operational concerns
### Observability
agentmon should expose its own metrics:
- ingest rate, queue lag, processor throughput
- DB write latency
- dropped events, invalid schema counts
### Testing
- Contract tests for event schema validation.
- Ingestion idempotency tests (replay same event).
- Processor tests for rollups.
- UI smoke tests for main dashboards.
## Open questions
- Do we want to support partial OTel mapping (trace/span ids) early?
- Do we want per-client local disk spool as part of the “official” SDK?
- How strict should schema validation be (reject unknown fields vs allow)?