feat: scaffold agentmon services and k8s deploy
Adds Go microservices (ingest-gateway, event-processor, query-api, web-ui), NATS+Postgres wiring, initial schema/init job, ingress manifests for LAN+tailnet, and a multi-arch image build script.
This commit is contained in:
@@ -0,0 +1,254 @@
|
||||
# agentmon — design (2026-01-16)
|
||||
|
||||
## Goals
|
||||
|
||||
agentmon is a self-hosted (homelab K8s) telemetry/analytics system for local agent runs (OpenCode + Claude Code). It captures structured events/spans and produces a custom web UI focused on: token usage, cost, latency, errors, and efficiency across sessions, agents, skills/commands, and models.
|
||||
|
||||
Primary requirements
|
||||
- Collect telemetry from OpenCode and Claude Code on local machines.
|
||||
- Ship telemetry to cluster over Tailnet + LAN.
|
||||
- Ingestion supports **WebSocket + HTTP**.
|
||||
- UI is a **custom web UI** (not Grafana).
|
||||
- Storage: SQLite for dev/small; Postgres for production.
|
||||
- Retention: keep forever by default; allow optional cleanup tooling.
|
||||
- No app-level auth (trusted network); enforce access with network-layer controls.
|
||||
|
||||
Non-goals
|
||||
- Distributed tracing compatibility (OTel) in v1; we may map later.
|
||||
- “Perfect” cost computation across all providers in v1; we store enough fields to recompute.
|
||||
|
||||
## Event Model
|
||||
|
||||
### Core concepts
|
||||
- **Session**: a human-initiated interactive timeframe (e.g., one terminal session).
|
||||
- **Run**: a single user invocation that triggers an agent workflow (a command, task, or skill execution).
|
||||
- **Trace**: a logical end-to-end chain of work (often equals a run, but can span).
|
||||
- **Span**: a timed unit of work (tool call, model call, indexing job, etc.).
|
||||
- **Event**: an append-only record (span start/end, run start/end, error, metric snapshot).
|
||||
|
||||
### Envelope (common to all events)
|
||||
All events are JSON objects. Fields marked **required** are required for every event type.
|
||||
|
||||
- `schema` (**required**): `{ name: "agentmon.event", version: 1 }`
|
||||
- `event` (**required**):
|
||||
- `id` (**required**): stable event UUID (UUIDv7 recommended)
|
||||
- `type` (**required**): one of:
|
||||
- `session.start`, `session.end`
|
||||
- `run.start`, `run.end`
|
||||
- `span.start`, `span.end`
|
||||
- `error`
|
||||
- `metric.snapshot`
|
||||
- `ts` (**required**): event timestamp (RFC3339 or unix-ms; choose one and standardize in SDK)
|
||||
- `seq` (optional): monotonic sequence per connection (enables gap detection + ACKing)
|
||||
- `source` (**required**):
|
||||
- `framework` (**required**): `opencode` | `claude-code` | `other`
|
||||
- `client_id` (**required**): stable id for the emitter install (per-machine is fine)
|
||||
- `host` (**required**): hostname
|
||||
- `user` (optional): local username
|
||||
- `version` (optional): emitter/SDK version
|
||||
- `correlation` (optional but strongly recommended):
|
||||
- `session_id` (recommended)
|
||||
- `run_id` (recommended)
|
||||
- `trace_id` (recommended)
|
||||
- `span_id` (required for span events)
|
||||
- `parent_span_id` (optional)
|
||||
- `attributes` (optional): freeform map for tags (agent name, skill, tool, model, repo, branch, etc.)
|
||||
- `payload` (optional): event-type-specific object (see below)
|
||||
|
||||
### Event types (minimum v1)
|
||||
|
||||
#### `session.start` / `session.end`
|
||||
- Required: `correlation.session_id`
|
||||
- Recommended `attributes`: `framework_session` (native id if exists), `cwd`, `repo`, `branch`
|
||||
|
||||
#### `run.start` / `run.end`
|
||||
- Required: `correlation.session_id`, `correlation.run_id`
|
||||
- Recommended `attributes`: `command`, `agent`, `workflow`, `prompt_hash?`
|
||||
- `run.end` may include aggregate `payload.usage` (token totals, cost totals, status)
|
||||
|
||||
#### `span.start` / `span.end`
|
||||
- Required: `correlation.trace_id`, `correlation.span_id`
|
||||
- Recommended `attributes`: `span_kind` (`llm`|`tool`|`skill`|`internal`), `name`
|
||||
- `span.start` `payload`: `{ start_ts }` (or omit if you trust `event.ts`)
|
||||
- `span.end` `payload`: `{ end_ts, status, duration_ms?, llm?, error? }`
|
||||
|
||||
#### `error`
|
||||
- Required: `payload.error`
|
||||
- Recommended: attach correlation ids when known (session/run/trace/span)
|
||||
|
||||
#### `metric.snapshot`
|
||||
- Used for periodic gauges (queue lag, emitter buffer size, etc.)
|
||||
- `payload.metrics`: map of numeric values
|
||||
|
||||
### LLM usage fields (when applicable)
|
||||
Stored on `span.end` (and optionally on `run.end` aggregates):
|
||||
- `llm.model`: provider/model string
|
||||
- `llm.usage`: `{ input_tokens, output_tokens, cache_write_tokens?, cache_read_tokens? }`
|
||||
- `llm.cost`: `{ input_usd?, output_usd?, total_usd? }`
|
||||
- `llm.finish_reason?`
|
||||
|
||||
### Errors
|
||||
- `error`: `{ type, message, code?, retryable?, source? }`
|
||||
- Errors can be standalone `error` events or embedded on `span.end`/`run.end`.
|
||||
|
||||
## Ingestion
|
||||
|
||||
### WebSocket (primary/live)
|
||||
- Endpoint: `GET /v1/ws`
|
||||
- Client sends one JSON event per WS message (simplest) or small batches.
|
||||
- Server acks in-order sequences:
|
||||
- Client includes `event.seq` (monotonic per connection)
|
||||
- Server responds: `{ "ack": { "up_to_seq": N } }`
|
||||
- Reconnect behavior:
|
||||
- Client reconnects with `?client_id=...`
|
||||
- Client replays any unacked events (idempotent by `event.id`)
|
||||
|
||||
### HTTP (batch/backfill)
|
||||
- Endpoint: `POST /v1/events`
|
||||
- Body: JSON array of events.
|
||||
- Response: `{ accepted: <n>, rejected: <n>, errors?: [...] }`
|
||||
|
||||
### Validation
|
||||
- Gateway validates `schema.name/version`, required envelope fields, and basic typing.
|
||||
- Non-strict mode (v1): allow unknown fields, store raw `payload`/`attributes`.
|
||||
|
||||
### Idempotency
|
||||
- Every event has stable `event.id` (UUIDv7 recommended).
|
||||
- Storage enforces unique `(event_id)` so retries + replays are safe.
|
||||
|
||||
### Backpressure
|
||||
- WS gateway may send `{ "control": { "slow_down_ms": 250 } }`.
|
||||
- If overloaded, close WS with a retryable close code; client backs off.
|
||||
- Emitters cap in-memory buffers and may optionally spool to disk.
|
||||
|
||||
## Service Architecture (microservices)
|
||||
|
||||
### v1 recommended services
|
||||
1) **ingest-gateway**
|
||||
- Exposes WS + HTTP endpoints.
|
||||
- Validates schema, assigns arrival timestamps, and publishes to queue.
|
||||
- Stateless; horizontal scaling.
|
||||
|
||||
2) **event-processor**
|
||||
- Consumes from queue.
|
||||
- Deduplicates by `event.id` and writes to DB.
|
||||
- Builds/updates rollup tables.
|
||||
|
||||
3) **query-api**
|
||||
- Read-only API used by the UI.
|
||||
- Performs filtered queries and serves aggregates.
|
||||
|
||||
4) **web-ui**
|
||||
- Custom frontend.
|
||||
- Talks only to query-api.
|
||||
|
||||
5) **retention-job** (CronJob)
|
||||
- Optional cleanup/compression policies.
|
||||
|
||||
### Queue choice
|
||||
- Preferred: **NATS JetStream** (durable stream, consumer groups).
|
||||
- Stream: `agentmon_events`
|
||||
- Subject pattern: `agentmon.events.v1`
|
||||
|
||||
Decision: queue-based to decouple ingest spikes from DB writes and to support reliable WS replay without holding DB transactions open.
|
||||
|
||||
## Kubernetes deployment outline
|
||||
|
||||
Namespace
|
||||
- `agentmon`
|
||||
|
||||
Workloads
|
||||
- Deployments:
|
||||
- `ingest-gateway` (Service `ingest-gateway`)
|
||||
- `event-processor` (no Service; talks to NATS + DB)
|
||||
- `query-api` (Service `query-api`)
|
||||
- `web-ui` (Service `web-ui`)
|
||||
- Stateful:
|
||||
- Postgres (or external managed); PVC via `longhorn`
|
||||
- NATS JetStream (or external); PVC via `longhorn`
|
||||
- CronJob:
|
||||
- `retention-job`
|
||||
|
||||
Ingress / exposure
|
||||
- `web-ui` and `ingest-gateway` exposed via Ingress with Tailnet/LAN allowlisting.
|
||||
- `query-api` internal-only (ClusterIP), only reachable from `web-ui`.
|
||||
|
||||
NetworkPolicies (default-deny)
|
||||
- Allow `web-ui` -> `query-api` TCP 80/443
|
||||
- Allow `query-api` -> Postgres TCP 5432
|
||||
- Allow `ingest-gateway` -> NATS TCP 4222 (and JetStream as needed)
|
||||
- Allow `event-processor` -> NATS + Postgres
|
||||
- Allow ingress-controller namespace -> `web-ui`/`ingest-gateway` Services
|
||||
|
||||
Config/Secrets
|
||||
- `DATABASE_URL` for processor/query-api
|
||||
- NATS connection URL for ingest/processor
|
||||
- (Optional) per-emitter shared secret, if you later decide to add lightweight auth
|
||||
|
||||
## Storage
|
||||
|
||||
### Postgres (production)
|
||||
Suggested tables (minimal):
|
||||
- `events`: append-only canonical storage
|
||||
- columns: `event_id` (pk), `ts`, `type`, `session_id`, `run_id`, `trace_id`, `span_id`, `parent_span_id`, `source_framework`, `client_id`, `payload_jsonb`
|
||||
- indexes on `(ts)`, `(session_id)`, `(run_id)`, `(type, ts)`
|
||||
- `runs`: derived/rollup
|
||||
- status, start/end ts, token totals, cost totals, error counts
|
||||
- `sessions`: derived/rollup
|
||||
- start/end ts, host/user, run counts
|
||||
|
||||
### SQLite (dev/small)
|
||||
- Same logical schema; keep SQL portable.
|
||||
- JSON stored as TEXT; use generated columns where supported.
|
||||
|
||||
## UI (MVP pages)
|
||||
|
||||
1) Overview
|
||||
- Total tokens/cost today/7d/30d
|
||||
- Latency and error rate trends
|
||||
- Top agents/skills by cost and failures
|
||||
|
||||
2) Sessions
|
||||
- Filter by date, host, agent, framework
|
||||
- Drilldown into a session timeline
|
||||
|
||||
3) Run detail
|
||||
- Waterfall view of spans
|
||||
- Per-span tokens/cost/latency
|
||||
- Error stack and retry chain
|
||||
|
||||
4) Errors
|
||||
- Aggregations by error type/source
|
||||
- Most common failing skills/tools
|
||||
|
||||
5) Agents
|
||||
- Leaderboard: cost, tokens/sec, success rate
|
||||
- Regression detection (compare last 24h vs baseline)
|
||||
|
||||
## Security model (no app auth)
|
||||
- Expose services only on Tailnet/LAN.
|
||||
- Ingress restricted to known source CIDRs / tailscale ingress.
|
||||
- K8s NetworkPolicies:
|
||||
- allow UI -> query-api
|
||||
- allow query-api -> DB
|
||||
- allow ingest-gateway from tailnet ingress only
|
||||
- deny all else by default
|
||||
|
||||
## Operational concerns
|
||||
|
||||
### Observability
|
||||
agentmon should expose its own metrics:
|
||||
- ingest rate, queue lag, processor throughput
|
||||
- DB write latency
|
||||
- dropped events, invalid schema counts
|
||||
|
||||
### Testing
|
||||
- Contract tests for event schema validation.
|
||||
- Ingestion idempotency tests (replay same event).
|
||||
- Processor tests for rollups.
|
||||
- UI smoke tests for main dashboards.
|
||||
|
||||
## Open questions
|
||||
- Do we want to support partial OTel mapping (trace/span ids) early?
|
||||
- Do we want per-client local disk spool as part of the “official” SDK?
|
||||
- How strict should schema validation be (reject unknown fields vs allow)?
|
||||
Reference in New Issue
Block a user