docs: swarm monitor design — infra page, docker labels, role-driven cards

2026-03-18 09:53:39 -07:00
parent e7be607db4
commit ecabc7fd19
1 changed files with 164 additions and 0 deletions
@@ -0,0 +1,164 @@
 # Swarm Monitor — Design
 **Date:** 2026-03-18
 **Goal:** Monitor docker-compose services in `~/lab/swarm` as part of the agents infrastructure. Add a `swarm-monitor` binary, dashboard strip, and replace the `/openclaw` page with a unified `/infrastructure` page showing both VMs and swarm services.
 ---
 ## Architecture
 Follows the `openclaw-monitor` pattern exactly. A new `swarm-monitor` binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.
 **New components:**
 - `cmd/swarm-monitor/main.go` — polling loop, event emission
 - `internal/monitor/swarm/types.go` — data model
 - `internal/monitor/swarm/collector.go` — Docker + HTTP collection
 **Existing components touched:**
 - `cmd/web-ui/static/app.js` — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure
 - `cmd/web-ui/static/style.css` — infrastructure page styles
 - `cmd/web-ui/static/index.html` — update nav link
 - `~/lab/swarm/docker-compose.yaml` — add `agentmon.*` labels to services
 ---
 ## Docker Labels
 Services are tagged via Docker labels so the monitor is opt-in and self-describing:
 ```yaml
 labels:
  agentmon.monitor: "true"
  agentmon.role: "llm-proxy"   # drives collection strategy + UI card
 ```
 **Defined roles:**
 | Role | Services | HTTP probe | Extra data |
 |------|----------|-----------|------------|
 | `llm-proxy` | litellm | `GET /health/liveliness` + `GET /v1/models` | model count, cooldown count |
 | `db` | litellm-db | none | Docker health only |
 | `search` | searxng | `GET /` | response time ms |
 | `mcp` | brave-search | port reachability | — |
 | `voice` | whisper-server, kokoro-tts | Docker healthcheck | — |
 | `automation` | n8n-agent | Docker healthcheck | — |
 The collector filters containers by `agentmon.monitor=true` via the Docker API, then dispatches to the role-specific probe strategy.
 ---
 ## Data Model
 ```go
 type ServiceSnapshot struct {
    Name           string         `json:"name"`
    Role           string         `json:"role"`
    ContainerState string         `json:"container_state"` // running/stopped/exited/missing
    HealthState    string         `json:"health_state"`    // healthy/unhealthy/starting/none
    Status         string         `json:"status"`          // healthy/degraded/down
    UptimeSec      int64          `json:"uptime_sec,omitempty"`
    HTTPStatus     *int           `json:"http_status,omitempty"`
    Extra          map[string]any `json:"extra,omitempty"`
 }
 type SwarmSnapshot struct {
    Services  []ServiceSnapshot `json:"services"`
    Issues    Issues            `json:"issues"`
    Timestamp time.Time         `json:"timestamp"`
 }
 type Issues struct {
    ServiceDown     []string `json:"service_down,omitempty"`
    ServiceDegraded []string `json:"service_degraded,omitempty"`
    LLMCooldowns    bool     `json:"llm_cooldowns,omitempty"`
 }
 ```
 **Status derivation:**
 - `down` — container not running or missing
 - `degraded` — running but HTTP probe failed or Docker healthcheck returns `unhealthy`
 - `healthy` — running + all probes pass
 **LiteLLM `extra`:** `{"model_count": 12, "cooldown_count": 0}`
 **Search `extra`:** `{"response_ms": 45}`
 ---
 ## Events
 Two event types emitted per poll:
 **`swarm.snapshot`** — all services bundled, used by dashboard strip and quick status:
 ```json
 {
  "schema": {"name": "agentmon.swarm", "version": 1},
  "event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
  "payload": {
    "services": [...],
    "issues": {"service_down": [], "llm_cooldowns": false}
  }
 }
 ```
 **`swarm.service.snapshot`** — one per service, used by infrastructure page cards for per-service history:
 ```json
 {
  "schema": {"name": "agentmon.swarm.service", "version": 1},
  "event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
  "payload": {
    "service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
  }
 }
 ```
 ---
 ## Frontend
 ### Dashboard
 - Existing VM strip (zap/orb/sun pills) stays unchanged
 - New **swarm strip** below it, driven by latest `swarm.snapshot` event
 - One pill per service: green=healthy, amber=degraded, red=down
 - Same pill component style as VM strip
 ### Navigation
 `OpenClaw` → `Infra`. Route `/openclaw` → `/infrastructure`.
 ### Infrastructure Page
 Two sections stacked vertically:
 ```
 VMs        [ zap card ]  [ orb card ]  [ sun card ]   (existing openclaw cards)
 Services   [ litellm ]  [ litellm-db ]  [ searxng ]  [ brave ]  [ whisper ]  [ kokoro ]  [ n8n ]
 ```
 ### Role-Driven Card Layouts
 | Role | Card content |
 |------|-------------|
 | `llm-proxy` | Status badge · model count · cooldown warning banner (if > 0) · HTTP health |
 | `db` | Status badge · uptime · Docker health dot |
 | `search` | Status badge · response time badge |
 | `mcp` | Status badge · port reachability dot |
 | `voice` | Status badge · Docker healthcheck state |
 | `automation` | Status badge · Docker healthcheck state |
 Cards update live via WebSocket `swarm.service.snapshot` events.
 ---
 ## Environment / Config
 ```
 NATS_URL          nats://nats:4222
 NATS_TOPIC        agentmon.events.v1
 POLL_INTERVAL     30s
 DOCKER_HOST       unix:///var/run/docker.sock
 LITELLM_BASE_URL  http://localhost:18804
 LITELLM_API_KEY   (from env)
 ```
 `swarm-monitor` runs on the host (same as openclaw-monitor), with access to the Docker socket.