diff --git a/docs/plans/2026-03-18-swarm-monitor-design.md b/docs/plans/2026-03-18-swarm-monitor-design.md new file mode 100644 index 0000000..710d612 --- /dev/null +++ b/docs/plans/2026-03-18-swarm-monitor-design.md @@ -0,0 +1,164 @@ +# Swarm Monitor — Design + +**Date:** 2026-03-18 +**Goal:** Monitor docker-compose services in `~/lab/swarm` as part of the agents infrastructure. Add a `swarm-monitor` binary, dashboard strip, and replace the `/openclaw` page with a unified `/infrastructure` page showing both VMs and swarm services. + +--- + +## Architecture + +Follows the `openclaw-monitor` pattern exactly. A new `swarm-monitor` binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes. + +**New components:** +- `cmd/swarm-monitor/main.go` — polling loop, event emission +- `internal/monitor/swarm/types.go` — data model +- `internal/monitor/swarm/collector.go` — Docker + HTTP collection + +**Existing components touched:** +- `cmd/web-ui/static/app.js` — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure +- `cmd/web-ui/static/style.css` — infrastructure page styles +- `cmd/web-ui/static/index.html` — update nav link +- `~/lab/swarm/docker-compose.yaml` — add `agentmon.*` labels to services + +--- + +## Docker Labels + +Services are tagged via Docker labels so the monitor is opt-in and self-describing: + +```yaml +labels: + agentmon.monitor: "true" + agentmon.role: "llm-proxy" # drives collection strategy + UI card +``` + +**Defined roles:** + +| Role | Services | HTTP probe | Extra data | +|------|----------|-----------|------------| +| `llm-proxy` | litellm | `GET /health/liveliness` + `GET /v1/models` | model count, cooldown count | +| `db` | litellm-db | none | Docker health only | +| `search` | searxng | `GET /` | response time ms | +| `mcp` | brave-search | port reachability | — | +| `voice` | whisper-server, kokoro-tts | Docker healthcheck | — | +| `automation` | n8n-agent | Docker healthcheck | — | + +The collector filters containers by `agentmon.monitor=true` via the Docker API, then dispatches to the role-specific probe strategy. + +--- + +## Data Model + +```go +type ServiceSnapshot struct { + Name string `json:"name"` + Role string `json:"role"` + ContainerState string `json:"container_state"` // running/stopped/exited/missing + HealthState string `json:"health_state"` // healthy/unhealthy/starting/none + Status string `json:"status"` // healthy/degraded/down + UptimeSec int64 `json:"uptime_sec,omitempty"` + HTTPStatus *int `json:"http_status,omitempty"` + Extra map[string]any `json:"extra,omitempty"` +} + +type SwarmSnapshot struct { + Services []ServiceSnapshot `json:"services"` + Issues Issues `json:"issues"` + Timestamp time.Time `json:"timestamp"` +} + +type Issues struct { + ServiceDown []string `json:"service_down,omitempty"` + ServiceDegraded []string `json:"service_degraded,omitempty"` + LLMCooldowns bool `json:"llm_cooldowns,omitempty"` +} +``` + +**Status derivation:** +- `down` — container not running or missing +- `degraded` — running but HTTP probe failed or Docker healthcheck returns `unhealthy` +- `healthy` — running + all probes pass + +**LiteLLM `extra`:** `{"model_count": 12, "cooldown_count": 0}` +**Search `extra`:** `{"response_ms": 45}` + +--- + +## Events + +Two event types emitted per poll: + +**`swarm.snapshot`** — all services bundled, used by dashboard strip and quick status: +```json +{ + "schema": {"name": "agentmon.swarm", "version": 1}, + "event": {"id": "...", "type": "swarm.snapshot", "ts": "..."}, + "payload": { + "services": [...], + "issues": {"service_down": [], "llm_cooldowns": false} + } +} +``` + +**`swarm.service.snapshot`** — one per service, used by infrastructure page cards for per-service history: +```json +{ + "schema": {"name": "agentmon.swarm.service", "version": 1}, + "event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."}, + "payload": { + "service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... } + } +} +``` + +--- + +## Frontend + +### Dashboard + +- Existing VM strip (zap/orb/sun pills) stays unchanged +- New **swarm strip** below it, driven by latest `swarm.snapshot` event +- One pill per service: green=healthy, amber=degraded, red=down +- Same pill component style as VM strip + +### Navigation + +`OpenClaw` → `Infra`. Route `/openclaw` → `/infrastructure`. + +### Infrastructure Page + +Two sections stacked vertically: + +``` +VMs [ zap card ] [ orb card ] [ sun card ] (existing openclaw cards) +Services [ litellm ] [ litellm-db ] [ searxng ] [ brave ] [ whisper ] [ kokoro ] [ n8n ] +``` + +### Role-Driven Card Layouts + +| Role | Card content | +|------|-------------| +| `llm-proxy` | Status badge · model count · cooldown warning banner (if > 0) · HTTP health | +| `db` | Status badge · uptime · Docker health dot | +| `search` | Status badge · response time badge | +| `mcp` | Status badge · port reachability dot | +| `voice` | Status badge · Docker healthcheck state | +| `automation` | Status badge · Docker healthcheck state | + +Cards update live via WebSocket `swarm.service.snapshot` events. + +--- + +## Environment / Config + +``` +NATS_URL nats://nats:4222 +NATS_TOPIC agentmon.events.v1 +POLL_INTERVAL 30s +DOCKER_HOST unix:///var/run/docker.sock +LITELLM_BASE_URL http://localhost:18804 +LITELLM_API_KEY (from env) +``` + +`swarm-monitor` runs on the host (same as openclaw-monitor), with access to the Docker socket.