diff --git a/docs/plans/2026-03-18-swarm-monitor-design.md b/docs/plans/2026-03-18-swarm-monitor-design.md
new file mode 100644
index 0000000..710d612
--- /dev/null
+++ b/docs/plans/2026-03-18-swarm-monitor-design.md
@@ -0,0 +1,164 @@
+# Swarm Monitor — Design
+
+**Date:** 2026-03-18
+**Goal:** Monitor docker-compose services in `~/lab/swarm` as part of the agents infrastructure. Add a `swarm-monitor` binary, dashboard strip, and replace the `/openclaw` page with a unified `/infrastructure` page showing both VMs and swarm services.
+
+---
+
+## Architecture
+
+Follows the `openclaw-monitor` pattern exactly. A new `swarm-monitor` binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.
+
+**New components:**
+- `cmd/swarm-monitor/main.go` — polling loop, event emission
+- `internal/monitor/swarm/types.go` — data model
+- `internal/monitor/swarm/collector.go` — Docker + HTTP collection
+
+**Existing components touched:**
+- `cmd/web-ui/static/app.js` — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure
+- `cmd/web-ui/static/style.css` — infrastructure page styles
+- `cmd/web-ui/static/index.html` — update nav link
+- `~/lab/swarm/docker-compose.yaml` — add `agentmon.*` labels to services
+
+---
+
+## Docker Labels
+
+Services are tagged via Docker labels so the monitor is opt-in and self-describing:
+
+```yaml
+labels:
+  agentmon.monitor: "true"
+  agentmon.role: "llm-proxy"   # drives collection strategy + UI card
+```
+
+**Defined roles:**
+
+| Role | Services | HTTP probe | Extra data |
+|------|----------|-----------|------------|
+| `llm-proxy` | litellm | `GET /health/liveliness` + `GET /v1/models` | model count, cooldown count |
+| `db` | litellm-db | none | Docker health only |
+| `search` | searxng | `GET /` | response time ms |
+| `mcp` | brave-search | port reachability | — |
+| `voice` | whisper-server, kokoro-tts | Docker healthcheck | — |
+| `automation` | n8n-agent | Docker healthcheck | — |
+
+The collector filters containers by `agentmon.monitor=true` via the Docker API, then dispatches to the role-specific probe strategy.
+
+---
+
+## Data Model
+
+```go
+type ServiceSnapshot struct {
+    Name           string         `json:"name"`
+    Role           string         `json:"role"`
+    ContainerState string         `json:"container_state"` // running/stopped/exited/missing
+    HealthState    string         `json:"health_state"`    // healthy/unhealthy/starting/none
+    Status         string         `json:"status"`          // healthy/degraded/down
+    UptimeSec      int64          `json:"uptime_sec,omitempty"`
+    HTTPStatus     *int           `json:"http_status,omitempty"`
+    Extra          map[string]any `json:"extra,omitempty"`
+}
+
+type SwarmSnapshot struct {
+    Services  []ServiceSnapshot `json:"services"`
+    Issues    Issues            `json:"issues"`
+    Timestamp time.Time         `json:"timestamp"`
+}
+
+type Issues struct {
+    ServiceDown     []string `json:"service_down,omitempty"`
+    ServiceDegraded []string `json:"service_degraded,omitempty"`
+    LLMCooldowns    bool     `json:"llm_cooldowns,omitempty"`
+}
+```
+
+**Status derivation:**
+- `down` — container not running or missing
+- `degraded` — running but HTTP probe failed or Docker healthcheck returns `unhealthy`
+- `healthy` — running + all probes pass
+
+**LiteLLM `extra`:** `{"model_count": 12, "cooldown_count": 0}`
+**Search `extra`:** `{"response_ms": 45}`
+
+---
+
+## Events
+
+Two event types emitted per poll:
+
+**`swarm.snapshot`** — all services bundled, used by dashboard strip and quick status:
+```json
+{
+  "schema": {"name": "agentmon.swarm", "version": 1},
+  "event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
+  "payload": {
+    "services": [...],
+    "issues": {"service_down": [], "llm_cooldowns": false}
+  }
+}
+```
+
+**`swarm.service.snapshot`** — one per service, used by infrastructure page cards for per-service history:
+```json
+{
+  "schema": {"name": "agentmon.swarm.service", "version": 1},
+  "event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
+  "payload": {
+    "service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
+  }
+}
+```
+
+---
+
+## Frontend
+
+### Dashboard
+
+- Existing VM strip (zap/orb/sun pills) stays unchanged
+- New **swarm strip** below it, driven by latest `swarm.snapshot` event
+- One pill per service: green=healthy, amber=degraded, red=down
+- Same pill component style as VM strip
+
+### Navigation
+
+`OpenClaw` → `Infra`. Route `/openclaw` → `/infrastructure`.
+
+### Infrastructure Page
+
+Two sections stacked vertically:
+
+```
+VMs        [ zap card ]  [ orb card ]  [ sun card ]   (existing openclaw cards)
+Services   [ litellm ]  [ litellm-db ]  [ searxng ]  [ brave ]  [ whisper ]  [ kokoro ]  [ n8n ]
+```
+
+### Role-Driven Card Layouts
+
+| Role | Card content |
+|------|-------------|
+| `llm-proxy` | Status badge · model count · cooldown warning banner (if > 0) · HTTP health |
+| `db` | Status badge · uptime · Docker health dot |
+| `search` | Status badge · response time badge |
+| `mcp` | Status badge · port reachability dot |
+| `voice` | Status badge · Docker healthcheck state |
+| `automation` | Status badge · Docker healthcheck state |
+
+Cards update live via WebSocket `swarm.service.snapshot` events.
+
+---
+
+## Environment / Config
+
+```
+NATS_URL          nats://nats:4222
+NATS_TOPIC        agentmon.events.v1
+POLL_INTERVAL     30s
+DOCKER_HOST       unix:///var/run/docker.sock
+LITELLM_BASE_URL  http://localhost:18804
+LITELLM_API_KEY   (from env)
+```
+
+`swarm-monitor` runs on the host (same as openclaw-monitor), with access to the Docker socket.