Files
agentmon/docs/plans/2026-01-16-agentmon-design.md
William Valentin 256b841cbf feat: scaffold agentmon services and k8s deploy
Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.
2026-01-17 01:06:57 -08:00

9.1 KiB

agentmon — design (2026-01-16)

Goals

agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.

Primary requirements

  • Collect telemetry from OpenCode and Claude Code on local machines.
  • Ship telemetry to cluster over Tailnet + LAN.
  • Ingestion supports WebSocket + HTTP.
  • UI is a custom web UI (not Grafana).
  • Storage: SQLite for dev/small; Postgres for production.
  • Retention: keep forever by default; allow optional cleanup tooling.
  • No app-level auth (trusted network); enforce access with network-layer controls.

Non-goals

  • Distributed tracing compatibility (OTel) in v1; we may map later.
  • “Perfect” cost computation across all providers in v1; we store enough fields to recompute.

Event Model

Core concepts

  • Session: a human-initiated interactive timeframe (e.g., one terminal session).
  • Run: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
  • Trace: a logical end-to-end chain of work (often equals a run, but can span).
  • Span: a timed unit of work (tool call, model call, indexing job, etc.).
  • Event: an append-only record (span start/end, run start/end, error, metric snapshot).

Envelope (common to all events)

All events are JSON objects. Fields marked required are required for every event type.

  • schema (required): { name: "agentmon.event", version: 1 }
  • event (required):
    • id (required): stable event UUID (UUIDv7 recommended)
    • type (required): one of:
      • session.start, session.end
      • run.start, run.end
      • span.start, span.end
      • error
      • metric.snapshot
    • ts (required): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
    • seq (optional): monotonic sequence per connection (enables gap detection + ACKing)
    • source (required):
      • framework (required): opencode | claude-code | other
      • client_id (required): stable id for the emitter install (per-machine is fine)
      • host (required): hostname
      • user (optional): local username
      • version (optional): emitter/SDK version
  • correlation (optional but strongly recommended):
    • session_id (recommended)
    • run_id (recommended)
    • trace_id (recommended)
    • span_id (required for span events)
    • parent_span_id (optional)
  • attributes (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
  • payload (optional): event-type-specific object (see below)

Event types (minimum v1)

session.start / session.end

  • Required: correlation.session_id
  • Recommended attributes: framework_session (native id if exists), cwd, repo, branch

run.start / run.end

  • Required: correlation.session_id, correlation.run_id
  • Recommended attributes: command, agent, workflow, prompt_hash?
  • run.end may include aggregate payload.usage (token totals, cost totals, status)

span.start / span.end

  • Required: correlation.trace_id, correlation.span_id
  • Recommended attributes: span_kind (llm|tool|skill|internal), name
  • span.start payload: { start_ts } (or omit if you trust event.ts)
  • span.end payload: { end_ts, status, duration_ms?, llm?, error? }

error

  • Required: payload.error
  • Recommended: attach correlation ids when known (session/run/trace/span)

metric.snapshot

  • Used for periodic gauges (queue lag, emitter buffer size, etc.)
  • payload.metrics: map of numeric values

LLM usage fields (when applicable)

Stored on span.end (and optionally on run.end aggregates):

  • llm.model: provider/model string
  • llm.usage: { input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }
  • llm.cost: { input_usd?, output_usd?, total_usd? }
  • llm.finish_reason?

Errors

  • error: { type, message, code?, retryable?, source? }
  • Errors can be standalone error events or embedded on span.end/run.end.

Ingestion

WebSocket (primary/live)

  • Endpoint: GET /v1/ws
  • Client sends one JSON event per WS message (simplest) or small batches.
  • Server acks in-order sequences:
    • Client includes event.seq (monotonic per connection)
    • Server responds: { "ack": { "up_to_seq": N } }
  • Reconnect behavior:
    • Client reconnects with ?client_id=...
    • Client replays any unacked events (idempotent by event.id)

HTTP (batch/backfill)

  • Endpoint: POST /v1/events
  • Body: JSON array of events.
  • Response: { accepted: <n>, rejected: <n>, errors?: [...] }

Validation

  • Gateway validates schema.name/version, required envelope fields, and basic typing.
  • Non-strict mode (v1): allow unknown fields, store raw payload/attributes.

Idempotency

  • Every event has stable event.id (UUIDv7 recommended).
  • Storage enforces unique (event_id) so retries + replays are safe.

Backpressure

  • WS gateway may send { "control": { "slow_down_ms": 250 } }.
  • If overloaded, close WS with a retryable close code; client backs off.
  • Emitters cap in-memory buffers and may optionally spool to disk.

Service Architecture (microservices)

  1. ingest-gateway
  • Exposes WS + HTTP endpoints.
  • Validates schema, assigns arrival timestamps, and publishes to queue.
  • Stateless; horizontal scaling.
  1. event-processor
  • Consumes from queue.
  • Deduplicates by event.id and writes to DB.
  • Builds/updates rollup tables.
  1. query-api
  • Read-only API used by the UI.
  • Performs filtered queries and serves aggregates.
  1. web-ui
  • Custom frontend.
  • Talks only to query-api.
  1. retention-job (CronJob)
  • Optional cleanup/compression policies.

Queue choice

  • Preferred: NATS JetStream (durable stream, consumer groups).
  • Stream: agentmon_events
  • Subject pattern: agentmon.events.v1

Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.

Kubernetes deployment outline

Namespace

  • agentmon

Workloads

  • Deployments:
    • ingest-gateway (Service ingest-gateway)
    • event-processor (no Service; talks to NATS + DB)
    • query-api (Service query-api)
    • web-ui (Service web-ui)
  • Stateful:
    • Postgres (or external managed); PVC via longhorn
    • NATS JetStream (or external); PVC via longhorn
  • CronJob:
    • retention-job

Ingress / exposure

  • web-ui and ingest-gateway exposed via Ingress with Tailnet/LAN allowlisting.
  • query-api internal-only (ClusterIP), only reachable from web-ui.

NetworkPolicies (default-deny)

  • Allow web-ui -> query-api TCP 80/443
  • Allow query-api -> Postgres TCP 5432
  • Allow ingest-gateway -> NATS TCP 4222 (and JetStream as needed)
  • Allow event-processor -> NATS + Postgres
  • Allow ingress-controller namespace -> web-ui/ingest-gateway Services

Config/Secrets

  • DATABASE_URL for processor/query-api
  • NATS connection URL for ingest/processor
  • (Optional) per-emitter shared secret, if you later decide to add lightweight auth

Storage

Postgres (production)

Suggested tables (minimal):

  • events: append-only canonical storage
    • columns: event_id (pk), ts, type, session_id, run_id, trace_id, span_id, parent_span_id, source_framework, client_id, payload_jsonb
    • indexes on (ts), (session_id), (run_id), (type, ts)
  • runs: derived/rollup
    • status, start/end ts, token totals, cost totals, error counts
  • sessions: derived/rollup
    • start/end ts, host/user, run counts

SQLite (dev/small)

  • Same logical schema; keep SQL portable.
  • JSON stored as TEXT; use generated columns where supported.

UI (MVP pages)

  1. Overview
  • Total tokens/cost today/7d/30d
  • Latency and error rate trends
  • Top agents/skills by cost and failures
  1. Sessions
  • Filter by date, host, agent, framework
  • Drilldown into a session timeline
  1. Run detail
  • Waterfall view of spans
  • Per-span tokens/cost/latency
  • Error stack and retry chain
  1. Errors
  • Aggregations by error type/source
  • Most common failing skills/tools
  1. Agents
  • Leaderboard: cost, tokens/sec, success rate
  • Regression detection (compare last 24h vs baseline)

Security model (no app auth)

  • Expose services only on Tailnet/LAN.
  • Ingress restricted to known source CIDRs / tailscale ingress.
  • K8s NetworkPolicies:
    • allow UI -> query-api
    • allow query-api -> DB
    • allow ingest-gateway from tailnet ingress only
    • deny all else by default

Operational concerns

Observability

agentmon should expose its own metrics:

  • ingest rate, queue lag, processor throughput
  • DB write latency
  • dropped events, invalid schema counts

Testing

  • Contract tests for event schema validation.
  • Ingestion idempotency tests (replay same event).
  • Processor tests for rollups.
  • UI smoke tests for main dashboards.

Open questions

  • Do we want to support partial OTel mapping (trace/span ids) early?
  • Do we want per-client local disk spool as part of the “official” SDK?
  • How strict should schema validation be (reject unknown fields vs allow)?