feat: scaffold agentmon services and k8s deploy

Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.
2026-01-17 01:06:57 -08:00
parent a584d7e274
commit 256b841cbf
28 changed files with 1554 additions and 0 deletions
@@ -0,0 +1,254 @@
+# agentmon — design (2026-01-16)
+
+## Goals
+
+agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.
+
+Primary requirements
+- Collect telemetry from OpenCode and Claude Code on local machines.
+- Ship telemetry to cluster over Tailnet + LAN.
+- Ingestion supports **WebSocket + HTTP**.
+- UI is a **custom web UI** (not Grafana).
+- Storage: SQLite for dev/small; Postgres for production.
+- Retention: keep forever by default; allow optional cleanup tooling.
+- No app-level auth (trusted network); enforce access with network-layer controls.
+
+Non-goals
+- Distributed tracing compatibility (OTel) in v1; we may map later.
+- “Perfect” cost computation across all providers in v1; we store enough fields to recompute.
+
+## Event Model
+
+### Core concepts
+- **Session**: a human-initiated interactive timeframe (e.g., one terminal session).
+- **Run**: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
+- **Trace**: a logical end-to-end chain of work (often equals a run, but can span).
+- **Span**: a timed unit of work (tool call, model call, indexing job, etc.).
+- **Event**: an append-only record (span start/end, run start/end, error, metric snapshot).
+
+### Envelope (common to all events)
+All events are JSON objects. Fields marked **required** are required for every event type.
+
+- `schema` (**required**): `{ name: "agentmon.event", version: 1 }`
+- `event` (**required**):
+  - `id` (**required**): stable event UUID (UUIDv7 recommended)
+  - `type` (**required**): one of:
+    - `session.start`, `session.end`
+    - `run.start`, `run.end`
+    - `span.start`, `span.end`
+    - `error`
+    - `metric.snapshot`
+  - `ts` (**required**): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
+  - `seq` (optional): monotonic sequence per connection (enables gap detection + ACKing)
+  - `source` (**required**):
+    - `framework` (**required**): `opencode` | `claude-code` | `other`
+    - `client_id` (**required**): stable id for the emitter install (per-machine is fine)
+    - `host` (**required**): hostname
+    - `user` (optional): local username
+    - `version` (optional): emitter/SDK version
+- `correlation` (optional but strongly recommended):
+  - `session_id` (recommended)
+  - `run_id` (recommended)
+  - `trace_id` (recommended)
+  - `span_id` (required for span events)
+  - `parent_span_id` (optional)
+- `attributes` (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
+- `payload` (optional): event-type-specific object (see below)
+
+### Event types (minimum v1)
+
+#### `session.start` / `session.end`
+- Required: `correlation.session_id`
+- Recommended `attributes`: `framework_session` (native id if exists), `cwd`, `repo`, `branch`
+
+#### `run.start` / `run.end`
+- Required: `correlation.session_id`, `correlation.run_id`
+- Recommended `attributes`: `command`, `agent`, `workflow`, `prompt_hash?`
+- `run.end` may include aggregate `payload.usage` (token totals, cost totals, status)
+
+#### `span.start` / `span.end`
+- Required: `correlation.trace_id`, `correlation.span_id`
+- Recommended `attributes`: `span_kind` (`llm`|`tool`|`skill`|`internal`), `name`
+- `span.start` `payload`: `{ start_ts }` (or omit if you trust `event.ts`)
+- `span.end` `payload`: `{ end_ts, status, duration_ms?, llm?, error? }`
+
+#### `error`
+- Required: `payload.error`
+- Recommended: attach correlation ids when known (session/run/trace/span)
+
+#### `metric.snapshot`
+- Used for periodic gauges (queue lag, emitter buffer size, etc.)
+- `payload.metrics`: map of numeric values
+
+### LLM usage fields (when applicable)
+Stored on `span.end` (and optionally on `run.end` aggregates):
+- `llm.model`: provider/model string
+- `llm.usage`: `{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }`
+- `llm.cost`: `{ input_usd?, output_usd?, total_usd? }`
+- `llm.finish_reason?`
+
+### Errors
+- `error`: `{ type, message, code?, retryable?, source? }`
+- Errors can be standalone `error` events or embedded on `span.end`/`run.end`.
+
+## Ingestion
+
+### WebSocket (primary/live)
+- Endpoint: `GET /v1/ws`
+- Client sends one JSON event per WS message (simplest) or small batches.
+- Server acks in-order sequences:
+  - Client includes `event.seq` (monotonic per connection)
+  - Server responds: `{ "ack": { "up_to_seq": N } }`
+- Reconnect behavior:
+  - Client reconnects with `?client_id=...`
+  - Client replays any unacked events (idempotent by `event.id`)
+
+### HTTP (batch/backfill)
+- Endpoint: `POST /v1/events`
+- Body: JSON array of events.
+- Response: `{ accepted: <n>, rejected: <n>, errors?: [...] }`
+
+### Validation
+- Gateway validates `schema.name/version`, required envelope fields, and basic typing.
+- Non-strict mode (v1): allow unknown fields, store raw `payload`/`attributes`.
+
+### Idempotency
+- Every event has stable `event.id` (UUIDv7 recommended).
+- Storage enforces unique `(event_id)` so retries + replays are safe.
+
+### Backpressure
+- WS gateway may send `{ "control": { "slow_down_ms": 250 } }`.
+- If overloaded, close WS with a retryable close code; client backs off.
+- Emitters cap in-memory buffers and may optionally spool to disk.
+
+## Service Architecture (microservices)
+
+### v1 recommended services
+1) **ingest-gateway**
+- Exposes WS + HTTP endpoints.
+- Validates schema, assigns arrival timestamps, and publishes to queue.
+- Stateless; horizontal scaling.
+
+2) **event-processor**
+- Consumes from queue.
+- Deduplicates by `event.id` and writes to DB.
+- Builds/updates rollup tables.
+
+3) **query-api**
+- Read-only API used by the UI.
+- Performs filtered queries and serves aggregates.
+
+4) **web-ui**
+- Custom frontend.
+- Talks only to query-api.
+
+5) **retention-job** (CronJob)
+- Optional cleanup/compression policies.
+
+### Queue choice
+- Preferred: **NATS JetStream** (durable stream, consumer groups).
+- Stream: `agentmon_events`
+- Subject pattern: `agentmon.events.v1`
+
+Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.
+
+## Kubernetes deployment outline
+
+Namespace
+- `agentmon`
+
+Workloads
+- Deployments:
+  - `ingest-gateway` (Service `ingest-gateway`)
+  - `event-processor` (no Service; talks to NATS + DB)
+  - `query-api` (Service `query-api`)
+  - `web-ui` (Service `web-ui`)
+- Stateful:
+  - Postgres (or external managed); PVC via `longhorn`
+  - NATS JetStream (or external); PVC via `longhorn`
+- CronJob:
+  - `retention-job`
+
+Ingress / exposure
+- `web-ui` and `ingest-gateway` exposed via Ingress with Tailnet/LAN allowlisting.
+- `query-api` internal-only (ClusterIP), only reachable from `web-ui`.
+
+NetworkPolicies (default-deny)
+- Allow `web-ui` -> `query-api` TCP 80/443
+- Allow `query-api` -> Postgres TCP 5432
+- Allow `ingest-gateway` -> NATS TCP 4222 (and JetStream as needed)
+- Allow `event-processor` -> NATS + Postgres
+- Allow ingress-controller namespace -> `web-ui`/`ingest-gateway` Services
+
+Config/Secrets
+- `DATABASE_URL` for processor/query-api
+- NATS connection URL for ingest/processor
+- (Optional) per-emitter shared secret, if you later decide to add lightweight auth
+
+## Storage
+
+### Postgres (production)
+Suggested tables (minimal):
+- `events`: append-only canonical storage
+  - columns: `event_id` (pk), `ts`, `type`, `session_id`, `run_id`, `trace_id`, `span_id`, `parent_span_id`, `source_framework`, `client_id`, `payload_jsonb`
+  - indexes on `(ts)`, `(session_id)`, `(run_id)`, `(type, ts)`
+- `runs`: derived/rollup
+  - status, start/end ts, token totals, cost totals, error counts
+- `sessions`: derived/rollup
+  - start/end ts, host/user, run counts
+
+### SQLite (dev/small)
+- Same logical schema; keep SQL portable.
+- JSON stored as TEXT; use generated columns where supported.
+
+## UI (MVP pages)
+
+1) Overview
+- Total tokens/cost today/7d/30d
+- Latency and error rate trends
+- Top agents/skills by cost and failures
+
+2) Sessions
+- Filter by date, host, agent, framework
+- Drilldown into a session timeline
+
+3) Run detail
+- Waterfall view of spans
+- Per-span tokens/cost/latency
+- Error stack and retry chain
+
+4) Errors
+- Aggregations by error type/source
+- Most common failing skills/tools
+
+5) Agents
+- Leaderboard: cost, tokens/sec, success rate
+- Regression detection (compare last 24h vs baseline)
+
+## Security model (no app auth)
+- Expose services only on Tailnet/LAN.
+- Ingress restricted to known source CIDRs / tailscale ingress.
+- K8s NetworkPolicies:
+  - allow UI -> query-api
+  - allow query-api -> DB
+  - allow ingest-gateway from tailnet ingress only
+  - deny all else by default
+
+## Operational concerns
+
+### Observability
+agentmon should expose its own metrics:
+- ingest rate, queue lag, processor throughput
+- DB write latency
+- dropped events, invalid schema counts
+
+### Testing
+- Contract tests for event schema validation.
+- Ingestion idempotency tests (replay same event).
+- Processor tests for rollups.
+- UI smoke tests for main dashboards.
+
+## Open questions
+- Do we want to support partial OTel mapping (trace/span ids) early?
+- Do we want per-client local disk spool as part of the “official” SDK?
+- How strict should schema validation be (reject unknown fields vs allow)?