Swarm Monitor — Design

Date: 2026-03-18 Goal: Monitor docker-compose services in ~/lab/swarm as part of the agents infrastructure. Add a swarm-monitor binary, dashboard strip, and replace the /openclaw page with a unified /infrastructure page showing both VMs and swarm services.

Architecture

Follows the openclaw-monitor pattern exactly. A new swarm-monitor binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.

New components:

cmd/swarm-monitor/main.go — polling loop, event emission
internal/monitor/swarm/types.go — data model
internal/monitor/swarm/collector.go — Docker + HTTP collection

Existing components touched:

cmd/web-ui/static/app.js — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure
cmd/web-ui/static/style.css — infrastructure page styles
cmd/web-ui/static/index.html — update nav link
~/lab/swarm/docker-compose.yaml — add agentmon.* labels to services

Docker Labels

Services are tagged via Docker labels so the monitor is opt-in and self-describing:

labels:
  agentmon.monitor: "true"
  agentmon.role: "llm-proxy"   # drives collection strategy + UI card

Defined roles:

Role	Services	HTTP probe	Extra data
`llm-proxy`	litellm	`GET /health/liveliness` + `GET /v1/models`	model count, cooldown count
`db`	litellm-db	none	Docker health only
`search`	searxng	`GET /`	response time ms
`mcp`	brave-search	port reachability	—
`voice`	whisper-server, kokoro-tts	Docker healthcheck	—
`automation`	n8n-agent	Docker healthcheck	—

The collector filters containers by agentmon.monitor=true via the Docker API, then dispatches to the role-specific probe strategy.

Data Model

type ServiceSnapshot struct {
    Name           string         `json:"name"`
    Role           string         `json:"role"`
    ContainerState string         `json:"container_state"` // running/stopped/exited/missing
    HealthState    string         `json:"health_state"`    // healthy/unhealthy/starting/none
    Status         string         `json:"status"`          // healthy/degraded/down
    UptimeSec      int64          `json:"uptime_sec,omitempty"`
    HTTPStatus     *int           `json:"http_status,omitempty"`
    Extra          map[string]any `json:"extra,omitempty"`
}

type SwarmSnapshot struct {
    Services  []ServiceSnapshot `json:"services"`
    Issues    Issues            `json:"issues"`
    Timestamp time.Time         `json:"timestamp"`
}

type Issues struct {
    ServiceDown     []string `json:"service_down,omitempty"`
    ServiceDegraded []string `json:"service_degraded,omitempty"`
    LLMCooldowns    bool     `json:"llm_cooldowns,omitempty"`
}

Status derivation:

down — container not running or missing
degraded — running but HTTP probe failed or Docker healthcheck returns unhealthy
healthy — running + all probes pass

LiteLLM extra: {"model_count": 12, "cooldown_count": 0} Search extra: {"response_ms": 45}

Events

Two event types emitted per poll:

swarm.snapshot — all services bundled, used by dashboard strip and quick status:

{
  "schema": {"name": "agentmon.swarm", "version": 1},
  "event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
  "payload": {
    "services": [...],
    "issues": {"service_down": [], "llm_cooldowns": false}
  }
}

swarm.service.snapshot — one per service, used by infrastructure page cards for per-service history:

{
  "schema": {"name": "agentmon.swarm.service", "version": 1},
  "event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
  "payload": {
    "service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
  }
}

Frontend

Dashboard

Existing VM strip (zap/orb/sun pills) stays unchanged
New swarm strip below it, driven by latest swarm.snapshot event
One pill per service: green=healthy, amber=degraded, red=down
Same pill component style as VM strip

OpenClaw → Infra. Route /openclaw → /infrastructure.

Infrastructure Page

Two sections stacked vertically:

VMs        [ zap card ]  [ orb card ]  [ sun card ]   (existing openclaw cards)
Services   [ litellm ]  [ litellm-db ]  [ searxng ]  [ brave ]  [ whisper ]  [ kokoro ]  [ n8n ]

Role-Driven Card Layouts

Role	Card content
`llm-proxy`	Status badge · model count · cooldown warning banner (if > 0) · HTTP health
`db`	Status badge · uptime · Docker health dot
`search`	Status badge · response time badge
`mcp`	Status badge · port reachability dot
`voice`	Status badge · Docker healthcheck state
`automation`	Status badge · Docker healthcheck state

Cards update live via WebSocket swarm.service.snapshot events.

Environment / Config

NATS_URL          nats://nats:4222
NATS_TOPIC        agentmon.events.v1
POLL_INTERVAL     30s
DOCKER_HOST       unix:///var/run/docker.sock
LITELLM_BASE_URL  http://localhost:18804
LITELLM_API_KEY   (from env)

swarm-monitor runs on the host (same as openclaw-monitor), with access to the Docker socket.

5.3 KiB Raw Permalink Blame History