Files
agentmon/docs/plans/2026-03-18-swarm-monitor-design.md

5.3 KiB

Swarm Monitor — Design

Date: 2026-03-18 Goal: Monitor docker-compose services in ~/lab/swarm as part of the agents infrastructure. Add a swarm-monitor binary, dashboard strip, and replace the /openclaw page with a unified /infrastructure page showing both VMs and swarm services.


Architecture

Follows the openclaw-monitor pattern exactly. A new swarm-monitor binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.

New components:

  • cmd/swarm-monitor/main.go — polling loop, event emission
  • internal/monitor/swarm/types.go — data model
  • internal/monitor/swarm/collector.go — Docker + HTTP collection

Existing components touched:

  • cmd/web-ui/static/app.js — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure
  • cmd/web-ui/static/style.css — infrastructure page styles
  • cmd/web-ui/static/index.html — update nav link
  • ~/lab/swarm/docker-compose.yaml — add agentmon.* labels to services

Docker Labels

Services are tagged via Docker labels so the monitor is opt-in and self-describing:

labels:
  agentmon.monitor: "true"
  agentmon.role: "llm-proxy"   # drives collection strategy + UI card

Defined roles:

Role Services HTTP probe Extra data
llm-proxy litellm GET /health/liveliness + GET /v1/models model count, cooldown count
db litellm-db none Docker health only
search searxng GET / response time ms
mcp brave-search port reachability
voice whisper-server, kokoro-tts Docker healthcheck
automation n8n-agent Docker healthcheck

The collector filters containers by agentmon.monitor=true via the Docker API, then dispatches to the role-specific probe strategy.


Data Model

type ServiceSnapshot struct {
    Name           string         `json:"name"`
    Role           string         `json:"role"`
    ContainerState string         `json:"container_state"` // running/stopped/exited/missing
    HealthState    string         `json:"health_state"`    // healthy/unhealthy/starting/none
    Status         string         `json:"status"`          // healthy/degraded/down
    UptimeSec      int64          `json:"uptime_sec,omitempty"`
    HTTPStatus     *int           `json:"http_status,omitempty"`
    Extra          map[string]any `json:"extra,omitempty"`
}

type SwarmSnapshot struct {
    Services  []ServiceSnapshot `json:"services"`
    Issues    Issues            `json:"issues"`
    Timestamp time.Time         `json:"timestamp"`
}

type Issues struct {
    ServiceDown     []string `json:"service_down,omitempty"`
    ServiceDegraded []string `json:"service_degraded,omitempty"`
    LLMCooldowns    bool     `json:"llm_cooldowns,omitempty"`
}

Status derivation:

  • down — container not running or missing
  • degraded — running but HTTP probe failed or Docker healthcheck returns unhealthy
  • healthy — running + all probes pass

LiteLLM extra: {"model_count": 12, "cooldown_count": 0} Search extra: {"response_ms": 45}


Events

Two event types emitted per poll:

swarm.snapshot — all services bundled, used by dashboard strip and quick status:

{
  "schema": {"name": "agentmon.swarm", "version": 1},
  "event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
  "payload": {
    "services": [...],
    "issues": {"service_down": [], "llm_cooldowns": false}
  }
}

swarm.service.snapshot — one per service, used by infrastructure page cards for per-service history:

{
  "schema": {"name": "agentmon.swarm.service", "version": 1},
  "event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
  "payload": {
    "service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
  }
}

Frontend

Dashboard

  • Existing VM strip (zap/orb/sun pills) stays unchanged
  • New swarm strip below it, driven by latest swarm.snapshot event
  • One pill per service: green=healthy, amber=degraded, red=down
  • Same pill component style as VM strip

Navigation

OpenClawInfra. Route /openclaw/infrastructure.

Infrastructure Page

Two sections stacked vertically:

VMs        [ zap card ]  [ orb card ]  [ sun card ]   (existing openclaw cards)
Services   [ litellm ]  [ litellm-db ]  [ searxng ]  [ brave ]  [ whisper ]  [ kokoro ]  [ n8n ]

Role-Driven Card Layouts

Role Card content
llm-proxy Status badge · model count · cooldown warning banner (if > 0) · HTTP health
db Status badge · uptime · Docker health dot
search Status badge · response time badge
mcp Status badge · port reachability dot
voice Status badge · Docker healthcheck state
automation Status badge · Docker healthcheck state

Cards update live via WebSocket swarm.service.snapshot events.


Environment / Config

NATS_URL          nats://nats:4222
NATS_TOPIC        agentmon.events.v1
POLL_INTERVAL     30s
DOCKER_HOST       unix:///var/run/docker.sock
LITELLM_BASE_URL  http://localhost:18804
LITELLM_API_KEY   (from env)

swarm-monitor runs on the host (same as openclaw-monitor), with access to the Docker socket.