256b841cbf
Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.
9.1 KiB
9.1 KiB
agentmon — design (2026-01-16)
Goals
agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.
Primary requirements
- Collect telemetry from OpenCode and Claude Code on local machines.
- Ship telemetry to cluster over Tailnet + LAN.
- Ingestion supports WebSocket + HTTP.
- UI is a custom web UI (not Grafana).
- Storage: SQLite for dev/small; Postgres for production.
- Retention: keep forever by default; allow optional cleanup tooling.
- No app-level auth (trusted network); enforce access with network-layer controls.
Non-goals
- Distributed tracing compatibility (OTel) in v1; we may map later.
- “Perfect” cost computation across all providers in v1; we store enough fields to recompute.
Event Model
Core concepts
- Session: a human-initiated interactive timeframe (e.g., one terminal session).
- Run: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
- Trace: a logical end-to-end chain of work (often equals a run, but can span).
- Span: a timed unit of work (tool call, model call, indexing job, etc.).
- Event: an append-only record (span start/end, run start/end, error, metric snapshot).
Envelope (common to all events)
All events are JSON objects. Fields marked required are required for every event type.
schema(required):{ name: "agentmon.event", version: 1 }event(required):id(required): stable event UUID (UUIDv7 recommended)type(required): one of:session.start,session.endrun.start,run.endspan.start,span.enderrormetric.snapshot
ts(required): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)seq(optional): monotonic sequence per connection (enables gap detection + ACKing)source(required):framework(required):opencode|claude-code|otherclient_id(required): stable id for the emitter install (per-machine is fine)host(required): hostnameuser(optional): local usernameversion(optional): emitter/SDK version
correlation(optional but strongly recommended):session_id(recommended)run_id(recommended)trace_id(recommended)span_id(required for span events)parent_span_id(optional)
attributes(optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)payload(optional): event-type-specific object (see below)
Event types (minimum v1)
session.start / session.end
- Required:
correlation.session_id - Recommended
attributes:framework_session(native id if exists),cwd,repo,branch
run.start / run.end
- Required:
correlation.session_id,correlation.run_id - Recommended
attributes:command,agent,workflow,prompt_hash? run.endmay include aggregatepayload.usage(token totals, cost totals, status)
span.start / span.end
- Required:
correlation.trace_id,correlation.span_id - Recommended
attributes:span_kind(llm|tool|skill|internal),name span.startpayload:{ start_ts }(or omit if you trustevent.ts)span.endpayload:{ end_ts, status, duration_ms?, llm?, error? }
error
- Required:
payload.error - Recommended: attach correlation ids when known (session/run/trace/span)
metric.snapshot
- Used for periodic gauges (queue lag, emitter buffer size, etc.)
payload.metrics: map of numeric values
LLM usage fields (when applicable)
Stored on span.end (and optionally on run.end aggregates):
llm.model: provider/model stringllm.usage:{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }llm.cost:{ input_usd?, output_usd?, total_usd? }llm.finish_reason?
Errors
error:{ type, message, code?, retryable?, source? }- Errors can be standalone
errorevents or embedded onspan.end/run.end.
Ingestion
WebSocket (primary/live)
- Endpoint:
GET /v1/ws - Client sends one JSON event per WS message (simplest) or small batches.
- Server acks in-order sequences:
- Client includes
event.seq(monotonic per connection) - Server responds:
{ "ack": { "up_to_seq": N } }
- Client includes
- Reconnect behavior:
- Client reconnects with
?client_id=... - Client replays any unacked events (idempotent by
event.id)
- Client reconnects with
HTTP (batch/backfill)
- Endpoint:
POST /v1/events - Body: JSON array of events.
- Response:
{ accepted: <n>, rejected: <n>, errors?: [...] }
Validation
- Gateway validates
schema.name/version, required envelope fields, and basic typing. - Non-strict mode (v1): allow unknown fields, store raw
payload/attributes.
Idempotency
- Every event has stable
event.id(UUIDv7 recommended). - Storage enforces unique
(event_id)so retries + replays are safe.
Backpressure
- WS gateway may send
{ "control": { "slow_down_ms": 250 } }. - If overloaded, close WS with a retryable close code; client backs off.
- Emitters cap in-memory buffers and may optionally spool to disk.
Service Architecture (microservices)
v1 recommended services
- ingest-gateway
- Exposes WS + HTTP endpoints.
- Validates schema, assigns arrival timestamps, and publishes to queue.
- Stateless; horizontal scaling.
- event-processor
- Consumes from queue.
- Deduplicates by
event.idand writes to DB. - Builds/updates rollup tables.
- query-api
- Read-only API used by the UI.
- Performs filtered queries and serves aggregates.
- web-ui
- Custom frontend.
- Talks only to query-api.
- retention-job (CronJob)
- Optional cleanup/compression policies.
Queue choice
- Preferred: NATS JetStream (durable stream, consumer groups).
- Stream:
agentmon_events - Subject pattern:
agentmon.events.v1
Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.
Kubernetes deployment outline
Namespace
agentmon
Workloads
- Deployments:
ingest-gateway(Serviceingest-gateway)event-processor(no Service; talks to NATS + DB)query-api(Servicequery-api)web-ui(Serviceweb-ui)
- Stateful:
- Postgres (or external managed); PVC via
longhorn - NATS JetStream (or external); PVC via
longhorn
- Postgres (or external managed); PVC via
- CronJob:
retention-job
Ingress / exposure
web-uiandingest-gatewayexposed via Ingress with Tailnet/LAN allowlisting.query-apiinternal-only (ClusterIP), only reachable fromweb-ui.
NetworkPolicies (default-deny)
- Allow
web-ui->query-apiTCP 80/443 - Allow
query-api-> Postgres TCP 5432 - Allow
ingest-gateway-> NATS TCP 4222 (and JetStream as needed) - Allow
event-processor-> NATS + Postgres - Allow ingress-controller namespace ->
web-ui/ingest-gatewayServices
Config/Secrets
DATABASE_URLfor processor/query-api- NATS connection URL for ingest/processor
- (Optional) per-emitter shared secret, if you later decide to add lightweight auth
Storage
Postgres (production)
Suggested tables (minimal):
events: append-only canonical storage- columns:
event_id(pk),ts,type,session_id,run_id,trace_id,span_id,parent_span_id,source_framework,client_id,payload_jsonb - indexes on
(ts),(session_id),(run_id),(type, ts)
- columns:
runs: derived/rollup- status, start/end ts, token totals, cost totals, error counts
sessions: derived/rollup- start/end ts, host/user, run counts
SQLite (dev/small)
- Same logical schema; keep SQL portable.
- JSON stored as TEXT; use generated columns where supported.
UI (MVP pages)
- Overview
- Total tokens/cost today/7d/30d
- Latency and error rate trends
- Top agents/skills by cost and failures
- Sessions
- Filter by date, host, agent, framework
- Drilldown into a session timeline
- Run detail
- Waterfall view of spans
- Per-span tokens/cost/latency
- Error stack and retry chain
- Errors
- Aggregations by error type/source
- Most common failing skills/tools
- Agents
- Leaderboard: cost, tokens/sec, success rate
- Regression detection (compare last 24h vs baseline)
Security model (no app auth)
- Expose services only on Tailnet/LAN.
- Ingress restricted to known source CIDRs / tailscale ingress.
- K8s NetworkPolicies:
- allow UI -> query-api
- allow query-api -> DB
- allow ingest-gateway from tailnet ingress only
- deny all else by default
Operational concerns
Observability
agentmon should expose its own metrics:
- ingest rate, queue lag, processor throughput
- DB write latency
- dropped events, invalid schema counts
Testing
- Contract tests for event schema validation.
- Ingestion idempotency tests (replay same event).
- Processor tests for rollups.
- UI smoke tests for main dashboards.
Open questions
- Do we want to support partial OTel mapping (trace/span ids) early?
- Do we want per-client local disk spool as part of the “official” SDK?
- How strict should schema validation be (reject unknown fields vs allow)?