docs: swarm monitor design — infra page, docker labels, role-driven cards
This commit is contained in:
@@ -0,0 +1,164 @@
|
||||
# Swarm Monitor — Design
|
||||
|
||||
**Date:** 2026-03-18
|
||||
**Goal:** Monitor docker-compose services in `~/lab/swarm` as part of the agents infrastructure. Add a `swarm-monitor` binary, dashboard strip, and replace the `/openclaw` page with a unified `/infrastructure` page showing both VMs and swarm services.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
Follows the `openclaw-monitor` pattern exactly. A new `swarm-monitor` binary polls every 30s, collects Docker + HTTP data, and publishes events to NATS. The existing event-processor → postgres → query-api pipeline requires no changes.
|
||||
|
||||
**New components:**
|
||||
- `cmd/swarm-monitor/main.go` — polling loop, event emission
|
||||
- `internal/monitor/swarm/types.go` — data model
|
||||
- `internal/monitor/swarm/collector.go` — Docker + HTTP collection
|
||||
|
||||
**Existing components touched:**
|
||||
- `cmd/web-ui/static/app.js` — infrastructure page, swarm strip on dashboard, rename openclaw → infrastructure
|
||||
- `cmd/web-ui/static/style.css` — infrastructure page styles
|
||||
- `cmd/web-ui/static/index.html` — update nav link
|
||||
- `~/lab/swarm/docker-compose.yaml` — add `agentmon.*` labels to services
|
||||
|
||||
---
|
||||
|
||||
## Docker Labels
|
||||
|
||||
Services are tagged via Docker labels so the monitor is opt-in and self-describing:
|
||||
|
||||
```yaml
|
||||
labels:
|
||||
agentmon.monitor: "true"
|
||||
agentmon.role: "llm-proxy" # drives collection strategy + UI card
|
||||
```
|
||||
|
||||
**Defined roles:**
|
||||
|
||||
| Role | Services | HTTP probe | Extra data |
|
||||
|------|----------|-----------|------------|
|
||||
| `llm-proxy` | litellm | `GET /health/liveliness` + `GET /v1/models` | model count, cooldown count |
|
||||
| `db` | litellm-db | none | Docker health only |
|
||||
| `search` | searxng | `GET /` | response time ms |
|
||||
| `mcp` | brave-search | port reachability | — |
|
||||
| `voice` | whisper-server, kokoro-tts | Docker healthcheck | — |
|
||||
| `automation` | n8n-agent | Docker healthcheck | — |
|
||||
|
||||
The collector filters containers by `agentmon.monitor=true` via the Docker API, then dispatches to the role-specific probe strategy.
|
||||
|
||||
---
|
||||
|
||||
## Data Model
|
||||
|
||||
```go
|
||||
type ServiceSnapshot struct {
|
||||
Name string `json:"name"`
|
||||
Role string `json:"role"`
|
||||
ContainerState string `json:"container_state"` // running/stopped/exited/missing
|
||||
HealthState string `json:"health_state"` // healthy/unhealthy/starting/none
|
||||
Status string `json:"status"` // healthy/degraded/down
|
||||
UptimeSec int64 `json:"uptime_sec,omitempty"`
|
||||
HTTPStatus *int `json:"http_status,omitempty"`
|
||||
Extra map[string]any `json:"extra,omitempty"`
|
||||
}
|
||||
|
||||
type SwarmSnapshot struct {
|
||||
Services []ServiceSnapshot `json:"services"`
|
||||
Issues Issues `json:"issues"`
|
||||
Timestamp time.Time `json:"timestamp"`
|
||||
}
|
||||
|
||||
type Issues struct {
|
||||
ServiceDown []string `json:"service_down,omitempty"`
|
||||
ServiceDegraded []string `json:"service_degraded,omitempty"`
|
||||
LLMCooldowns bool `json:"llm_cooldowns,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
**Status derivation:**
|
||||
- `down` — container not running or missing
|
||||
- `degraded` — running but HTTP probe failed or Docker healthcheck returns `unhealthy`
|
||||
- `healthy` — running + all probes pass
|
||||
|
||||
**LiteLLM `extra`:** `{"model_count": 12, "cooldown_count": 0}`
|
||||
**Search `extra`:** `{"response_ms": 45}`
|
||||
|
||||
---
|
||||
|
||||
## Events
|
||||
|
||||
Two event types emitted per poll:
|
||||
|
||||
**`swarm.snapshot`** — all services bundled, used by dashboard strip and quick status:
|
||||
```json
|
||||
{
|
||||
"schema": {"name": "agentmon.swarm", "version": 1},
|
||||
"event": {"id": "...", "type": "swarm.snapshot", "ts": "..."},
|
||||
"payload": {
|
||||
"services": [...],
|
||||
"issues": {"service_down": [], "llm_cooldowns": false}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**`swarm.service.snapshot`** — one per service, used by infrastructure page cards for per-service history:
|
||||
```json
|
||||
{
|
||||
"schema": {"name": "agentmon.swarm.service", "version": 1},
|
||||
"event": {"id": "...", "type": "swarm.service.snapshot", "ts": "..."},
|
||||
"payload": {
|
||||
"service": { "name": "litellm", "role": "llm-proxy", "status": "healthy", ... }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Frontend
|
||||
|
||||
### Dashboard
|
||||
|
||||
- Existing VM strip (zap/orb/sun pills) stays unchanged
|
||||
- New **swarm strip** below it, driven by latest `swarm.snapshot` event
|
||||
- One pill per service: green=healthy, amber=degraded, red=down
|
||||
- Same pill component style as VM strip
|
||||
|
||||
### Navigation
|
||||
|
||||
`OpenClaw` → `Infra`. Route `/openclaw` → `/infrastructure`.
|
||||
|
||||
### Infrastructure Page
|
||||
|
||||
Two sections stacked vertically:
|
||||
|
||||
```
|
||||
VMs [ zap card ] [ orb card ] [ sun card ] (existing openclaw cards)
|
||||
Services [ litellm ] [ litellm-db ] [ searxng ] [ brave ] [ whisper ] [ kokoro ] [ n8n ]
|
||||
```
|
||||
|
||||
### Role-Driven Card Layouts
|
||||
|
||||
| Role | Card content |
|
||||
|------|-------------|
|
||||
| `llm-proxy` | Status badge · model count · cooldown warning banner (if > 0) · HTTP health |
|
||||
| `db` | Status badge · uptime · Docker health dot |
|
||||
| `search` | Status badge · response time badge |
|
||||
| `mcp` | Status badge · port reachability dot |
|
||||
| `voice` | Status badge · Docker healthcheck state |
|
||||
| `automation` | Status badge · Docker healthcheck state |
|
||||
|
||||
Cards update live via WebSocket `swarm.service.snapshot` events.
|
||||
|
||||
---
|
||||
|
||||
## Environment / Config
|
||||
|
||||
```
|
||||
NATS_URL nats://nats:4222
|
||||
NATS_TOPIC agentmon.events.v1
|
||||
POLL_INTERVAL 30s
|
||||
DOCKER_HOST unix:///var/run/docker.sock
|
||||
LITELLM_BASE_URL http://localhost:18804
|
||||
LITELLM_API_KEY (from env)
|
||||
```
|
||||
|
||||
`swarm-monitor` runs on the host (same as openclaw-monitor), with access to the Docker socket.
|
||||
Reference in New Issue
Block a user