5.3 KiB
Swarm Monitor — Design
Date: 2026-03-18
Goal: Monitor docker-compose services in ~/lab/swarm as part of the agents infrastructure. Add a swarm-monitor binary, dashboard strip, and replace the /openclaw page with a unified /infrastructure page showing both VMs and swarm services.
Architecture
Follows the openclaw-monitor pattern exactly. A new swarm-monitor binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.
New components:
cmd/swarm-monitor/main.go— polling loop, event emissioninternal/monitor/swarm/types.go— data modelinternal/monitor/swarm/collector.go— Docker + HTTP collection
Existing components touched:
cmd/web-ui/static/app.js— infrastructure page, swarm strip on dashboard, rename openclaw → infrastructurecmd/web-ui/static/style.css— infrastructure page stylescmd/web-ui/static/index.html— update nav link~/lab/swarm/docker-compose.yaml— addagentmon.*labels to services
Docker Labels
Services are tagged via Docker labels so the monitor is opt-in and self-describing:
labels:
agentmon.monitor: "true"
agentmon.role: "llm-proxy" # drives collection strategy + UI card
Defined roles:
| Role | Services | HTTP probe | Extra data |
|---|---|---|---|
llm-proxy |
litellm | GET /health/liveliness + GET /v1/models |
model count, cooldown count |
db |
litellm-db | none | Docker health only |
search |
searxng | GET / |
response time ms |
mcp |
brave-search | port reachability | — |
voice |
whisper-server, kokoro-tts | Docker healthcheck | — |
automation |
n8n-agent | Docker healthcheck | — |
The collector filters containers by agentmon.monitor=true via the Docker API, then dispatches to the role-specific probe strategy.
Data Model
type ServiceSnapshot struct {
Name string `json:"name"`
Role string `json:"role"`
ContainerState string `json:"container_state"` // running/stopped/exited/missing
HealthState string `json:"health_state"` // healthy/unhealthy/starting/none
Status string `json:"status"` // healthy/degraded/down
UptimeSec int64 `json:"uptime_sec,omitempty"`
HTTPStatus *int `json:"http_status,omitempty"`
Extra map[string]any `json:"extra,omitempty"`
}
type SwarmSnapshot struct {
Services []ServiceSnapshot `json:"services"`
Issues Issues `json:"issues"`
Timestamp time.Time `json:"timestamp"`
}
type Issues struct {
ServiceDown []string `json:"service_down,omitempty"`
ServiceDegraded []string `json:"service_degraded,omitempty"`
LLMCooldowns bool `json:"llm_cooldowns,omitempty"`
}
Status derivation:
down— container not running or missingdegraded— running but HTTP probe failed or Docker healthcheck returnsunhealthyhealthy— running + all probes pass
LiteLLM extra: {"model_count": 12, "cooldown_count": 0}
Search extra: {"response_ms": 45}
Events
Two event types emitted per poll:
swarm.snapshot — all services bundled, used by dashboard strip and quick status:
{
"schema": {"name": "agentmon.swarm", "version": 1},
"event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
"payload": {
"services": [...],
"issues": {"service_down": [], "llm_cooldowns": false}
}
}
swarm.service.snapshot — one per service, used by infrastructure page cards for per-service history:
{
"schema": {"name": "agentmon.swarm.service", "version": 1},
"event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
"payload": {
"service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
}
}
Frontend
Dashboard
- Existing VM strip (zap/orb/sun pills) stays unchanged
- New swarm strip below it, driven by latest
swarm.snapshotevent - One pill per service: green=healthy, amber=degraded, red=down
- Same pill component style as VM strip
Navigation
OpenClaw → Infra. Route /openclaw → /infrastructure.
Infrastructure Page
Two sections stacked vertically:
VMs [ zap card ] [ orb card ] [ sun card ] (existing openclaw cards)
Services [ litellm ] [ litellm-db ] [ searxng ] [ brave ] [ whisper ] [ kokoro ] [ n8n ]
Role-Driven Card Layouts
| Role | Card content |
|---|---|
llm-proxy |
Status badge · model count · cooldown warning banner (if > 0) · HTTP health |
db |
Status badge · uptime · Docker health dot |
search |
Status badge · response time badge |
mcp |
Status badge · port reachability dot |
voice |
Status badge · Docker healthcheck state |
automation |
Status badge · Docker healthcheck state |
Cards update live via WebSocket swarm.service.snapshot events.
Environment / Config
NATS_URL nats://nats:4222
NATS_TOPIC agentmon.events.v1
POLL_INTERVAL 30s
DOCKER_HOST unix:///var/run/docker.sock
LITELLM_BASE_URL http://localhost:18804
LITELLM_API_KEY (from env)
swarm-monitor runs on the host (same as openclaw-monitor), with access to the Docker socket.