Files

T

William Valentin 256b841cbf feat: scaffold agentmon services and k8s deploy

Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.

2026-01-17 01:06:57 -08:00

9.1 KiB

Raw Permalink Blame History

agentmon — design (2026-01-16)

Goals

agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.

Primary requirements

Collect telemetry from OpenCode and Claude Code on local machines.
Ship telemetry to cluster over Tailnet + LAN.
Ingestion supports WebSocket + HTTP.
UI is a custom web UI (not Grafana).
Storage: SQLite for dev/small; Postgres for production.
Retention: keep forever by default; allow optional cleanup tooling.
No app-level auth (trusted network); enforce access with network-layer controls.

Non-goals

Distributed tracing compatibility (OTel) in v1; we may map later.
“Perfect” cost computation across all providers in v1; we store enough fields to recompute.

Event Model

Core concepts

Session: a human-initiated interactive timeframe (e.g., one terminal session).
Run: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
Trace: a logical end-to-end chain of work (often equals a run, but can span).
Span: a timed unit of work (tool call, model call, indexing job, etc.).
Event: an append-only record (span start/end, run start/end, error, metric snapshot).

Envelope (common to all events)

All events are JSON objects. Fields marked required are required for every event type.

schema (required): { name: "agentmon.event", version: 1 }
event (required):
- id (required): stable event UUID (UUIDv7 recommended)
- type (required): one of:
  - session.start, session.end
  - run.start, run.end
  - span.start, span.end
  - error
  - metric.snapshot
- ts (required): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
- seq (optional): monotonic sequence per connection (enables gap detection + ACKing)
- source (required):
  - framework (required): opencode | claude-code | other
  - client_id (required): stable id for the emitter install (per-machine is fine)
  - host (required): hostname
  - user (optional): local username
  - version (optional): emitter/SDK version
correlation (optional but strongly recommended):
- session_id (recommended)
- run_id (recommended)
- trace_id (recommended)
- span_id (required for span events)
- parent_span_id (optional)
attributes (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
payload (optional): event-type-specific object (see below)

Event types (minimum v1)

`session.start` / `session.end`

Required: correlation.session_id
Recommended attributes: framework_session (native id if exists), cwd, repo, branch

`run.start` / `run.end`

Required: correlation.session_id, correlation.run_id
Recommended attributes: command, agent, workflow, prompt_hash?
run.end may include aggregate payload.usage (token totals, cost totals, status)

`span.start` / `span.end`

Required: correlation.trace_id, correlation.span_id
Recommended attributes: span_kind (llm|tool|skill|internal), name
span.start payload: { start_ts } (or omit if you trust event.ts)
span.end payload: { end_ts, status, duration_ms?, llm?, error? }

`error`

Required: payload.error
Recommended: attach correlation ids when known (session/run/trace/span)

`metric.snapshot`

Used for periodic gauges (queue lag, emitter buffer size, etc.)
payload.metrics: map of numeric values

LLM usage fields (when applicable)

Stored on span.end (and optionally on run.end aggregates):

llm.model: provider/model string
llm.usage: { input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }
llm.cost: { input_usd?, output_usd?, total_usd? }
llm.finish_reason?

Errors

error: { type, message, code?, retryable?, source? }
Errors can be standalone error events or embedded on span.end/run.end.

Ingestion

WebSocket (primary/live)

Endpoint: GET /v1/ws
Client sends one JSON event per WS message (simplest) or small batches.
Server acks in-order sequences:
- Client includes event.seq (monotonic per connection)
- Server responds: { "ack": { "up_to_seq": N } }
Reconnect behavior:
- Client reconnects with ?client_id=...
- Client replays any unacked events (idempotent by event.id)

HTTP (batch/backfill)

Endpoint: POST /v1/events
Body: JSON array of events.
Response: { accepted: <n>, rejected: <n>, errors?: [...] }

Validation

Gateway validates schema.name/version, required envelope fields, and basic typing.
Non-strict mode (v1): allow unknown fields, store raw payload/attributes.

Idempotency

Every event has stable event.id (UUIDv7 recommended).
Storage enforces unique (event_id) so retries + replays are safe.

Backpressure

WS gateway may send { "control": { "slow_down_ms": 250 } }.
If overloaded, close WS with a retryable close code; client backs off.
Emitters cap in-memory buffers and may optionally spool to disk.

Service Architecture (microservices)

v1 recommended services

ingest-gateway

Exposes WS + HTTP endpoints.
Validates schema, assigns arrival timestamps, and publishes to queue.
Stateless; horizontal scaling.

event-processor

Consumes from queue.
Deduplicates by event.id and writes to DB.
Builds/updates rollup tables.

query-api

Read-only API used by the UI.
Performs filtered queries and serves aggregates.

web-ui

Custom frontend.
Talks only to query-api.

retention-job (CronJob)

Optional cleanup/compression policies.

Queue choice

Preferred: NATS JetStream (durable stream, consumer groups).
Stream: agentmon_events
Subject pattern: agentmon.events.v1

Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.

Kubernetes deployment outline

Namespace

agentmon

Workloads

Deployments:
- ingest-gateway (Service ingest-gateway)
- event-processor (no Service; talks to NATS + DB)
- query-api (Service query-api)
- web-ui (Service web-ui)
Stateful:
- Postgres (or external managed); PVC via longhorn
- NATS JetStream (or external); PVC via longhorn
CronJob:
- retention-job

Ingress / exposure

web-ui and ingest-gateway exposed via Ingress with Tailnet/LAN allowlisting.
query-api internal-only (ClusterIP), only reachable from web-ui.

NetworkPolicies (default-deny)

Allow web-ui -> query-api TCP 80/443
Allow query-api -> Postgres TCP 5432
Allow ingest-gateway -> NATS TCP 4222 (and JetStream as needed)
Allow event-processor -> NATS + Postgres
Allow ingress-controller namespace -> web-ui/ingest-gateway Services

Config/Secrets

DATABASE_URL for processor/query-api
NATS connection URL for ingest/processor
(Optional) per-emitter shared secret, if you later decide to add lightweight auth

Storage

Postgres (production)

Suggested tables (minimal):

events: append-only canonical storage
- columns: event_id (pk), ts, type, session_id, run_id, trace_id, span_id, parent_span_id, source_framework, client_id, payload_jsonb
- indexes on (ts), (session_id), (run_id), (type, ts)
runs: derived/rollup
- status, start/end ts, token totals, cost totals, error counts
sessions: derived/rollup
- start/end ts, host/user, run counts

SQLite (dev/small)

Same logical schema; keep SQL portable.
JSON stored as TEXT; use generated columns where supported.

UI (MVP pages)

Overview

Total tokens/cost today/7d/30d
Latency and error rate trends
Top agents/skills by cost and failures

Sessions

Filter by date, host, agent, framework
Drilldown into a session timeline

Run detail

Waterfall view of spans
Per-span tokens/cost/latency
Error stack and retry chain

Errors

Aggregations by error type/source
Most common failing skills/tools

Agents

Leaderboard: cost, tokens/sec, success rate
Regression detection (compare last 24h vs baseline)

Security model (no app auth)

Expose services only on Tailnet/LAN.
Ingress restricted to known source CIDRs / tailscale ingress.
K8s NetworkPolicies:
- allow UI -> query-api
- allow query-api -> DB
- allow ingest-gateway from tailnet ingress only
- deny all else by default

Operational concerns

Observability

agentmon should expose its own metrics:

ingest rate, queue lag, processor throughput
DB write latency
dropped events, invalid schema counts

Testing

Contract tests for event schema validation.
Ingestion idempotency tests (replay same event).
Processor tests for rollups.
UI smoke tests for main dashboards.

Open questions

Do we want to support partial OTel mapping (trace/span ids) early?
Do we want per-client local disk spool as part of the “official” SDK?
How strict should schema validation be (reject unknown fields vs allow)?

9.1 KiB Raw Permalink Blame History