diff --git a/docs/plans/2026-03-18-swarm-monitor-plan.md b/docs/plans/2026-03-18-swarm-monitor-plan.md new file mode 100644 index 0000000..8665f1c --- /dev/null +++ b/docs/plans/2026-03-18-swarm-monitor-plan.md @@ -0,0 +1,1282 @@ +# Swarm Monitor Implementation Plan + +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. + +**Goal:** Add a `swarm-monitor` binary that polls docker-compose services in `~/lab/swarm`, emits `swarm.snapshot` and `swarm.service.snapshot` events to NATS, and surfaces service status on the dashboard strip and a new unified `/infrastructure` page (replacing `/openclaw`). + +**Architecture:** New `cmd/swarm-monitor/main.go` polls via `docker inspect` exec commands and HTTP probes, emitting two event types per poll. The existing NATS → event-processor → postgres → query-api pipeline requires zero changes. Frontend adds a swarm strip to the dashboard and merges VM cards + service cards on a renamed `/infrastructure` page. + +**Tech Stack:** Go (exec/docker CLI, net/http), vanilla JS, existing NATS publisher pattern + +--- + +### Task 1: Add agentmon labels to docker-compose.yaml + +**Files:** +- Modify: `/home/will/lab/swarm/docker-compose.yaml` + +**Step 1: Add labels to each service** + +Add a `labels:` block to each monitored service. `litellm-init` is a one-shot container — do NOT label it. + +For `whisper-server` (after its `healthcheck:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "voice" + agentmon.port: "18801" +``` + +For `kokoro-tts` (after `restart: unless-stopped`): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "voice" + agentmon.port: "18805" +``` + +For `brave-search` (after its `environment:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "mcp" + agentmon.port: "18802" +``` + +For `searxng` (after its `volumes:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "search" + agentmon.port: "18803" +``` + +For `litellm` (after its `healthcheck:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "llm-proxy" + agentmon.port: "18804" +``` + +For `litellm-db` (after its `healthcheck:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "db" +``` + +For `n8n-agent` (after its `healthcheck:` block): +```yaml + labels: + agentmon.monitor: "true" + agentmon.role: "automation" + agentmon.port: "18808" +``` + +**Step 2: Verify labels appear in running containers** + +Run: `docker ps --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Status}}"` + +Expected: lists currently-running swarm containers (whichever profiles are active). + +**Step 3: Commit** + +```bash +cd /home/will/lab/swarm +git add docker-compose.yaml +git commit -m "feat: add agentmon monitor labels to swarm services" +``` + +--- + +### Task 2: Create swarm types + +**Files:** +- Create: `internal/monitor/swarm/types.go` + +**Step 1: Create the types file** + +```go +package swarm + +import "time" + +// ServiceSnapshot holds the collected state for one docker-compose service. +type ServiceSnapshot struct { + Name string `json:"name"` + Role string `json:"role"` + ContainerState string `json:"container_state"` // running/stopped/exited/missing + HealthState string `json:"health_state"` // healthy/unhealthy/starting/none + Status string `json:"status"` // healthy/degraded/down + UptimeSec int64 `json:"uptime_sec,omitempty"` + HTTPStatus *int `json:"http_status,omitempty"` + Extra map[string]any `json:"extra,omitempty"` +} + +// SwarmSnapshot holds a rolled-up snapshot of all labeled services. +type SwarmSnapshot struct { + Services []ServiceSnapshot `json:"services"` + Issues Issues `json:"issues"` + Timestamp time.Time `json:"timestamp"` +} + +// Issues flags notable problems detected during a poll. +type Issues struct { + ServiceDown []string `json:"service_down,omitempty"` + ServiceDegraded []string `json:"service_degraded,omitempty"` + LLMCooldowns bool `json:"llm_cooldowns,omitempty"` +} +``` + +**Step 2: Verify it compiles** + +Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/` +Expected: no errors + +**Step 3: Commit** + +```bash +git add internal/monitor/swarm/types.go +git commit -m "feat: add swarm monitor types" +``` + +--- + +### Task 3: Create swarm collector + +**Files:** +- Create: `internal/monitor/swarm/collector.go` + +**Step 1: Create the collector** + +```go +package swarm + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "os/exec" + "strconv" + "strings" + "time" +) + +// Config holds collector configuration. +type Config struct { + LiteLLMBaseURL string + LiteLLMAPIKey string + HTTPTimeout time.Duration +} + +// dockerPsEntry is the JSON shape from `docker ps --format '{{json .}}'`. +type dockerPsEntry struct { + ID string `json:"ID"` + Names string `json:"Names"` + Status string `json:"Status"` + State string `json:"State"` +} + +// dockerInspectEntry is the minimal shape we need from `docker inspect`. +type dockerInspectEntry struct { + Name string `json:"Name"` + State struct { + Status string `json:"Status"` + Running bool `json:"Running"` + StartedAt string `json:"StartedAt"` + Health *struct { + Status string `json:"Status"` + } `json:"Health"` + } `json:"State"` + Config struct { + Labels map[string]string `json:"Labels"` + } `json:"Config"` +} + +// CollectAll lists all containers labeled agentmon.monitor=true and collects +// a ServiceSnapshot for each. +func CollectAll(ctx context.Context, cfg Config) ([]ServiceSnapshot, error) { + // List labeled containers (running + stopped). + out, err := exec.CommandContext(ctx, "docker", "ps", "-a", + "--filter", "label=agentmon.monitor=true", + "--format", "{{json .}}", + ).Output() + if err != nil { + return nil, fmt.Errorf("docker ps failed: %w", err) + } + + var entries []dockerPsEntry + for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") { + if line == "" { + continue + } + var e dockerPsEntry + if err := json.Unmarshal([]byte(line), &e); err != nil { + continue + } + entries = append(entries, e) + } + + client := &http.Client{Timeout: cfg.HTTPTimeout} + var snapshots []ServiceSnapshot + for _, e := range entries { + snap := collectOne(ctx, e.Names, client, cfg) + snapshots = append(snapshots, snap) + } + + return snapshots, nil +} + +func collectOne(ctx context.Context, name string, client *http.Client, cfg Config) ServiceSnapshot { + snap := ServiceSnapshot{ + Name: name, + ContainerState: "missing", + HealthState: "none", + Status: "down", + } + + // Inspect for detailed state. + out, err := exec.CommandContext(ctx, "docker", "inspect", "--format", "{{json .}}", name).Output() + if err != nil { + return snap + } + + var detail dockerInspectEntry + if err := json.Unmarshal(out, &detail); err != nil { + return snap + } + + snap.Role = detail.Config.Labels["agentmon.role"] + snap.ContainerState = detail.State.Status + + if detail.State.Health != nil { + snap.HealthState = detail.State.Health.Status + } + + // Calculate uptime if running. + if detail.State.Running && detail.State.StartedAt != "" { + if t, err := time.Parse(time.RFC3339Nano, detail.State.StartedAt); err == nil { + snap.UptimeSec = int64(time.Since(t).Seconds()) + } + } + + // Role-specific probes. + switch snap.Role { + case "llm-proxy": + collectLLMProxy(ctx, &snap, client, cfg) + case "search": + collectHTTPProbe(ctx, &snap, client, "http://localhost:"+detail.Config.Labels["agentmon.port"]+"/") + case "mcp": + collectPortProbe(ctx, &snap, detail.Config.Labels["agentmon.port"]) + case "db", "voice", "automation": + // Docker healthcheck state is sufficient; no HTTP probe. + } + + snap.Status = deriveStatus(snap) + return snap +} + +func collectLLMProxy(ctx context.Context, snap *ServiceSnapshot, client *http.Client, cfg Config) { + if snap.Extra == nil { + snap.Extra = make(map[string]any) + } + + // Health probe. + req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/health/liveliness", nil) + resp, err := client.Do(req) + if err == nil { + code := resp.StatusCode + snap.HTTPStatus = &code + resp.Body.Close() + } + + // Model count. + if cfg.LiteLLMAPIKey != "" { + req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/v2/model/info", nil) + req.Header.Set("Authorization", "Bearer "+cfg.LiteLLMAPIKey) + resp, err := client.Do(req) + if err == nil { + defer resp.Body.Close() + var result struct { + Data []struct { + ModelName string `json:"model_name"` + } `json:"data"` + } + if json.NewDecoder(resp.Body).Decode(&result) == nil { + snap.Extra["model_count"] = len(result.Data) + } + } + } +} + +func collectHTTPProbe(ctx context.Context, snap *ServiceSnapshot, client *http.Client, url string) { + start := time.Now() + req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) + resp, err := client.Do(req) + if err == nil { + code := resp.StatusCode + snap.HTTPStatus = &code + resp.Body.Close() + ms := time.Since(start).Milliseconds() + if snap.Extra == nil { + snap.Extra = make(map[string]any) + } + snap.Extra["response_ms"] = ms + } +} + +func collectPortProbe(ctx context.Context, snap *ServiceSnapshot, port string) { + if port == "" { + return + } + // Use nc to check TCP reachability. + err := exec.CommandContext(ctx, "nc", "-z", "-w1", "localhost", port).Run() + reachable := err == nil + if snap.Extra == nil { + snap.Extra = make(map[string]any) + } + snap.Extra["port_reachable"] = reachable +} + +// deriveStatus computes the overall status from container state + health + probes. +func deriveStatus(snap ServiceSnapshot) string { + if snap.ContainerState != "running" { + return "down" + } + if snap.HealthState == "unhealthy" { + return "degraded" + } + if snap.HTTPStatus != nil && (*snap.HTTPStatus < 200 || *snap.HTTPStatus >= 400) { + return "degraded" + } + if reachable, ok := snap.Extra["port_reachable"].(bool); ok && !reachable { + return "degraded" + } + return "healthy" +} + +// DetectIssues scans a set of snapshots for notable problems. +func DetectIssues(services []ServiceSnapshot) Issues { + issues := Issues{} + for _, s := range services { + switch s.Status { + case "down": + issues.ServiceDown = append(issues.ServiceDown, s.Name) + case "degraded": + issues.ServiceDegraded = append(issues.ServiceDegraded, s.Name) + } + if s.Role == "llm-proxy" { + if extra := s.Extra; extra != nil { + if count, ok := extra["cooldown_count"].(int); ok && count > 0 { + issues.LLMCooldowns = true + } + } + } + } + return issues +} + +func intPtr(v int) *int { return &v } +func _ = intPtr // suppress unused warning +func _ = strconv.Itoa // imported for potential future use +``` + +**Step 2: Verify it compiles** + +Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/` +Expected: no errors + +**Step 3: Commit** + +```bash +git add internal/monitor/swarm/collector.go +git commit -m "feat: add swarm collector with docker inspect + HTTP probes" +``` + +--- + +### Task 4: Create swarm-monitor binary + +**Files:** +- Create: `cmd/swarm-monitor/main.go` + +**Step 1: Create the binary** + +```go +package main + +import ( + "context" + "encoding/json" + "log" + "os" + "time" + + "agentmon/internal/monitor/swarm" + qnats "agentmon/internal/queue/nats" +) + +func main() { + natsURL := envDefault("NATS_URL", "nats://nats:4222") + natsTopic := envDefault("NATS_TOPIC", "agentmon.events.v1") + interval := envDefault("POLL_INTERVAL", "30s") + litellmBase := envDefault("LITELLM_BASE_URL", "http://localhost:18804") + litellmKey := os.Getenv("LITELLM_MASTER_KEY") + + pub, err := qnats.NewPublisher(natsURL, natsTopic) + if err != nil { + log.Fatalf("failed to connect to NATS: %v", err) + } + defer pub.Close() + + pollDuration, err := time.ParseDuration(interval) + if err != nil { + log.Fatalf("invalid poll interval: %v", err) + } + + cfg := swarm.Config{ + LiteLLMBaseURL: litellmBase, + LiteLLMAPIKey: litellmKey, + HTTPTimeout: 5 * time.Second, + } + + ticker := time.NewTicker(pollDuration) + defer ticker.Stop() + + ctx := context.Background() + log.Printf("swarm-monitor started, polling every %s", pollDuration) + + // Poll immediately on start. + if err := poll(ctx, pub, cfg); err != nil { + log.Printf("initial poll error: %v", err) + } + + for range ticker.C { + if err := poll(ctx, pub, cfg); err != nil { + log.Printf("poll error: %v", err) + } + } +} + +func poll(ctx context.Context, pub *qnats.Publisher, cfg swarm.Config) error { + services, err := swarm.CollectAll(ctx, cfg) + if err != nil { + return err + } + + issues := swarm.DetectIssues(services) + now := time.Now().UTC() + + // Emit rolled-up swarm.snapshot. + if err := emit(ctx, pub, "swarm.snapshot", "agentmon.swarm", map[string]any{ + "services": services, + "issues": issues, + }, now); err != nil { + log.Printf("failed to emit swarm.snapshot: %v", err) + } + + // Emit one swarm.service.snapshot per service. + for _, svc := range services { + if err := emit(ctx, pub, "swarm.service.snapshot", "agentmon.swarm.service", map[string]any{ + "service": svc, + }, now); err != nil { + log.Printf("failed to emit swarm.service.snapshot for %s: %v", svc.Name, err) + } + } + + return nil +} + +func emit(ctx context.Context, pub *qnats.Publisher, eventType, schemaName string, payload map[string]any, ts time.Time) error { + event := map[string]any{ + "schema": map[string]any{ + "name": schemaName, + "version": 1, + }, + "event": map[string]any{ + "id": generateID(), + "type": eventType, + "ts": ts.Format(time.RFC3339Nano), + }, + "payload": payload, + } + + data, err := json.Marshal(event) + if err != nil { + return err + } + + return pub.Publish(ctx, data) +} + +func generateID() string { + return time.Now().Format("20060102150405") + "-" + randomString(8) +} + +func randomString(n int) string { + const chars = "abcdefghijklmnopqrstuvwxyz0123456789" + b := make([]byte, n) + for i := range b { + b[i] = chars[time.Now().Nanosecond()%len(chars)] + time.Sleep(time.Nanosecond) + } + return string(b) +} + +func envDefault(key, def string) string { + if v := os.Getenv(key); v != "" { + return v + } + return def +} +``` + +**Step 2: Verify it compiles** + +Run: `cd /home/will/lab/agentmon && go build ./cmd/swarm-monitor/` +Expected: no errors + +**Step 3: Verify all binaries still build** + +Run: `cd /home/will/lab/agentmon && go build ./...` +Expected: no errors + +**Step 4: Commit** + +```bash +git add cmd/swarm-monitor/main.go +git commit -m "feat: add swarm-monitor binary" +``` + +--- + +### Task 5: Dashboard swarm strip + +**Files:** +- Modify: `cmd/web-ui/static/app.js` +- Modify: `cmd/web-ui/static/style.css` + +**Step 1: Add swarmState and merge function to app.js** + +Near the top of the IIFE, alongside the existing `let openclawState = ...` declaration (line ~49), add: + +```js +let swarmState = { services: {} }; // keyed by service name +``` + +After the existing `mergeOpenClawEvents` function (~line 716), add: + +```js +function mergeSwarmSnapshot(evt) { + const payload = getEnvelopePayload(evt); + const services = payload.services || []; + for (const svc of services) { + if (svc.name) swarmState.services[svc.name] = svc; + } +} + +function mergeSwarmServiceSnapshot(evt) { + const payload = getEnvelopePayload(evt); + const svc = payload.service; + if (svc && svc.name) swarmState.services[svc.name] = svc; +} +``` + +**Step 2: Add swarm strip to renderDashboard** + +In `renderDashboard()`, the HTML template already has: +```html +
+``` + +Right after that line, add a swarm strip div: +```html +
+``` + +**Step 3: Add renderSwarmStrip function** + +After the `renderAgentVMStrip_dash` function (~line 1351), add: + +```js +function renderSwarmStrip_dash() { + const strip = document.getElementById('dash-swarm-strip'); + if (!strip) return; + const services = Object.values(swarmState.services); + if (services.length === 0) return; + strip.innerHTML = services.map(svc => { + const statusClass = svc.status === 'healthy' ? 'active' + : svc.status === 'degraded' ? 'degraded' : 'inactive'; + const label = svc.status || 'unknown'; + return ` +
+ + ${escapeHTML(svc.name)} + ${escapeHTML(label)} +
+ `; + }).join(''); +} +``` + +**Step 4: Wire swarm strip into dashboard data load** + +In `renderDashboard()`, the `Promise.all` block loads initial data. After `mergeOpenClawEvents(snapshots.events || [])` and `renderAgentVMStrip_dash()`, add: + +```js +const swarmSnaps = await api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })); +for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt); +renderSwarmStrip_dash(); +``` + +Note: this needs to be inside the try block, before the `if (!isCurrentPath('/')) return;` guard. The simplest placement is to add it to the `Promise.all` array: + +Replace the `Promise.all` call in `renderDashboard` to add swarm snapshots: +```js +const [summaryData, tsData, recentData, snapshots, swarmSnaps] = await Promise.all([ + api('/v1/stats/summary'), + api('/v1/stats/timeseries?window=1h'), + api('/v1/events?limit=20'), + api('/v1/events?event_type=openclaw.snapshot&limit=100').catch(() => ({ events: [] })), + api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })), +]); +``` + +Then after `renderAgentVMStrip_dash()`: +```js +for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt); +renderSwarmStrip_dash(); +``` + +**Step 5: Handle swarm events in handleDashboardWS** + +In `handleDashboardWS`, after the `openclaw.snapshot` handler block, add: + +```js +if (eventType === 'swarm.snapshot') { + mergeSwarmSnapshot(msg.data); + renderSwarmStrip_dash(); + return; +} +if (eventType === 'swarm.service.snapshot') { + mergeSwarmServiceSnapshot(msg.data); + renderSwarmStrip_dash(); + return; +} +``` + +**Step 6: Add swarm strip CSS** + +In `style.css`, after the `.vm-pill-label` block (~line 750), add: + +```css +/* ── Swarm strip ──────────────────────────────────────────── */ +.swarm-strip { + display: flex; + flex-wrap: wrap; + gap: 0.75rem; + margin-bottom: 1.5rem; +} + +.vm-pill.degraded { + border-color: rgba(251, 191, 36, 0.3); +} + +.vm-pill.degraded .vm-pill-dot { + background: var(--warning); +} +``` + +**Step 7: Verify no JS errors** + +Build check: `cd /home/will/lab/agentmon && go build ./...` +Expected: no errors + +**Step 8: Commit** + +```bash +git add cmd/web-ui/static/app.js cmd/web-ui/static/style.css +git commit -m "feat: add swarm strip to dashboard" +``` + +--- + +### Task 6: Infrastructure page CSS + +**Files:** +- Modify: `cmd/web-ui/static/style.css` + +**Step 1: Add infrastructure page styles** + +Append to the end of `style.css`: + +```css +/* ── Infrastructure page ──────────────────────────────────── */ +.infra-section-title { + font-family: var(--font-display); + font-size: 0.75rem; + font-weight: 700; + color: var(--text-dim); + text-transform: uppercase; + letter-spacing: 0.12em; + margin: 0 0 1rem 0; +} + +.infra-section { + margin-bottom: 2rem; +} + +/* Service card grid */ +.service-grid { + display: grid; + grid-template-columns: repeat(auto-fill, minmax(260px, 1fr)); + gap: 1.25rem; +} + +.service-card { + background: var(--surface); + border: 1px solid var(--border); + border-radius: var(--radius-lg); + padding: 1.125rem 1.25rem; + display: flex; + flex-direction: column; + gap: 0.75rem; + transition: border-color 0.2s; +} + +.service-card:hover { + border-color: rgba(34, 211, 238, 0.15); +} + +.service-card-header { + display: flex; + align-items: center; + justify-content: space-between; +} + +.service-card-name { + font-family: var(--font-mono); + font-size: 0.88rem; + font-weight: 600; + color: var(--text-bright); +} + +.service-badge { + font-size: 0.65rem; + font-weight: 700; + text-transform: uppercase; + letter-spacing: 0.08em; + padding: 0.2rem 0.55rem; + border-radius: 999px; +} + +.service-badge.healthy { + background: rgba(52, 211, 153, 0.12); + color: var(--success); + border: 1px solid rgba(52, 211, 153, 0.2); +} + +.service-badge.degraded { + background: rgba(251, 191, 36, 0.12); + color: var(--warning); + border: 1px solid rgba(251, 191, 36, 0.2); +} + +.service-badge.down { + background: rgba(248, 113, 113, 0.12); + color: var(--error); + border: 1px solid rgba(248, 113, 113, 0.2); +} + +.service-role-tag { + font-size: 0.65rem; + font-family: var(--font-mono); + color: var(--text-dim); + margin-top: -0.25rem; +} + +.service-stats { + display: flex; + flex-direction: column; + gap: 0.3rem; + font-size: 0.78rem; +} + +.service-stat-row { + display: flex; + justify-content: space-between; + align-items: center; +} + +.service-stat-label { + color: var(--text-dim); + font-family: var(--font-mono); + font-size: 0.72rem; +} + +.service-stat-value { + color: var(--text); + font-family: var(--font-mono); + font-size: 0.75rem; +} + +.service-stat-value.ok { color: var(--success); } +.service-stat-value.warn { color: var(--warning); } +.service-stat-value.bad { color: var(--error); } + +/* LiteLLM cooldown warning */ +.llm-cooldown-banner { + background: rgba(251, 191, 36, 0.08); + border: 1px solid rgba(251, 191, 36, 0.2); + border-radius: var(--radius); + padding: 0.4rem 0.625rem; + font-size: 0.72rem; + color: var(--warning); + font-family: var(--font-mono); +} + +/* LiteLLM model count highlight */ +.llm-model-count { + font-family: var(--font-display); + font-size: 1.5rem; + font-weight: 800; + color: var(--text-bright); + letter-spacing: -0.02em; + line-height: 1; +} + +.llm-model-label { + font-size: 0.68rem; + color: var(--text-dim); + text-transform: uppercase; + letter-spacing: 0.08em; +} +``` + +**Step 2: Commit** + +```bash +git add cmd/web-ui/static/style.css +git commit -m "feat: add infrastructure page CSS" +``` + +--- + +### Task 7: Infrastructure page JS + nav rename + +**Files:** +- Modify: `cmd/web-ui/static/app.js` +- Modify: `cmd/web-ui/static/index.html` + +**Step 1: Update nav in index.html** + +Change the nav link from `OpenClaw` to `Infra` and update the href: + +Old: +```html + +``` + +New: +```html + +``` + +**Step 2: Update the router in app.js** + +Change line ~153: +```js +} else if (path.startsWith('/openclaw')) { + renderOpenClaw(); +``` +to: +```js +} else if (path.startsWith('/infrastructure')) { + renderInfrastructure(); +``` + +**Step 3: Add infraUnsubscribe state variable** + +Near the existing `let openclawUnsubscribe = null;` declaration (~line 50), add: +```js +let infraUnsubscribe = null; +``` + +**Step 4: Update cleanupLiveViews to clean up infra subscription** + +Find the `cleanupLiveViews` function (~line 107). Replace: +```js +if (openclawUnsubscribe) { + openclawUnsubscribe(); + openclawUnsubscribe = null; +} +``` +with: +```js +if (openclawUnsubscribe) { + openclawUnsubscribe(); + openclawUnsubscribe = null; +} +if (infraUnsubscribe) { + infraUnsubscribe(); + infraUnsubscribe = null; +} +``` + +**Step 5: Replace renderOpenClaw with renderInfrastructure** + +Replace the existing `renderOpenClaw` function (lines ~664-680) entirely with: + +```js +async function renderInfrastructure() { + app.innerHTML = '

Loading...

'; + + infraUnsubscribe = subscribeWS(handleInfraWS); + + try { + const [ocData, swarmData] = await Promise.all([ + api('/v1/events?event_type=openclaw.snapshot&limit=100'), + api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })), + ]); + + mergeOpenClawEvents(ocData.events || []); + for (const evt of swarmData.events || []) mergeSwarmSnapshot(evt); + + if (isCurrentPath('/infrastructure')) { + renderInfraGrid(); + } + } catch (e) { + if (isCurrentPath('/infrastructure')) { + app.innerHTML = `

Error: ${escapeHTML(e.message)}

`; + } + } +} +``` + +**Step 6: Replace handleOpenClawWS with handleInfraWS** + +Replace the existing `handleOpenClawWS` function (lines ~682-699) with: + +```js +function handleInfraWS(msg) { + if (msg.type !== 'message') return; + + const eventType = getEnvelopeType(msg.data); + + if (eventType === 'openclaw.snapshot') { + mergeOpenClawEvents([msg.data]); + if (isCurrentPath('/infrastructure')) renderInfraGrid(); + if (isCurrentPath('/agents')) renderAgentVMStrip(); + return; + } + + if (eventType === 'swarm.snapshot') { + mergeSwarmSnapshot(msg.data); + if (isCurrentPath('/infrastructure')) renderInfraGrid(); + renderSwarmStrip_dash(); + return; + } + + if (eventType === 'swarm.service.snapshot') { + mergeSwarmServiceSnapshot(msg.data); + if (isCurrentPath('/infrastructure')) renderInfraGrid(); + renderSwarmStrip_dash(); + return; + } +} +``` + +**Step 7: Add renderInfraGrid function** + +Replace the existing `renderOpenClawGrid` function (lines ~718-785) with a new `renderInfraGrid` that shows both VMs and service cards. Add it right after the new `handleInfraWS` function: + +```js +function renderInfraGrid() { + const vmNames = Object.keys(openclawState.instances).sort(); + const services = Object.values(swarmState.services); + + app.innerHTML = ` + + +
+

VMs

+ ${vmNames.length === 0 + ? '

No VM data

' + : `
${vmNames.map(name => renderVMCard(name)).join('')}
` + } +
+ +
+

Services

+ ${services.length === 0 + ? '

No swarm service data

' + : `
${services.map(svc => renderServiceCard(svc)).join('')}
` + } +
+ `; +} + +function renderVMCard(name) { + const evt = openclawState.instances[name]; + const payload = getEnvelopePayload(evt); + const inst = payload.instance || {}; + const host = payload.host || {}; + const guest = payload.guest; + const issues = payload.issues; + + return ` +
+
+

${escapeHTML(inst.name || name)}

+
+ ${host.state === 'running' ? 'Running' : 'Stopped'} +
+
+
Updated ${escapeHTML(relativeTime(getEnvelopeTS(evt)))}
+ + + + + + + +
Host${escapeHTML(inst.host || '-')}
Domain${escapeHTML(inst.domain || '-')}
vCPUs${host.vcpus || '-'}
Memory${escapeHTML(formatBytes(host.memory_kib ? host.memory_kib * 1024 : 0) || '-')}
Disk${escapeHTML(formatBytes(host.disk_actual_bytes) || '-')}
Autostart${host.autostart ? 'Yes' : 'No'}
+ ${guest ? ` +
+ + + + + + + + +
Gateway${guest.service_active ? 'Active' : 'Inactive'}
HTTP${guest.http_status || 'N/A'}
Version${escapeHTML(guest.version || '-')}
Guest Mem${guest.memory_percent !== undefined ? guest.memory_percent.toFixed(1) : '-'}%
Guest Disk${guest.disk_percent !== undefined ? guest.disk_percent.toFixed(1) : '-'}%
Load${guest.load_average !== undefined ? guest.load_average.toFixed(2) : '-'}
Uptime${escapeHTML(guest.service_uptime || '-')}
+ ` : ''} + ${issues && Object.values(issues).some(Boolean) ? ` +
+
Issues
+
+ ${Object.entries(issues).filter(([, value]) => value).map(([key]) => ` + ${escapeHTML(key.replace(/_/g, ' '))} + `).join('')} +
+ ` : ''} +
+ `; +} + +function renderServiceCard(svc) { + const role = svc.role || 'unknown'; + switch (role) { + case 'llm-proxy': return renderLLMProxyCard(svc); + case 'db': return renderDBCard(svc); + case 'search': return renderSearchCard(svc); + case 'mcp': return renderMCPCard(svc); + case 'voice': return renderVoiceCard(svc); + case 'automation':return renderAutomationCard(svc); + default: return renderGenericServiceCard(svc); + } +} + +function serviceCardHeader(svc) { + return ` +
+
+
${escapeHTML(svc.name)}
+
${escapeHTML(svc.role || '')}
+
+ ${escapeHTML(svc.status || 'down')} +
+ `; +} + +function serviceStatRow(label, value, valueClass) { + return ` +
+ ${escapeHTML(label)} + ${value} +
+ `; +} + +function formatUptime(sec) { + if (!sec) return '-'; + if (sec < 60) return sec + 's'; + if (sec < 3600) return Math.floor(sec / 60) + 'm'; + if (sec < 86400) return Math.floor(sec / 3600) + 'h ' + Math.floor((sec % 3600) / 60) + 'm'; + return Math.floor(sec / 86400) + 'd ' + Math.floor((sec % 86400) / 3600) + 'h'; +} + +function renderLLMProxyCard(svc) { + const extra = svc.extra || {}; + const modelCount = extra.model_count; + const cooldowns = extra.cooldown_count || 0; + const httpStatus = svc.http_status; + const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : ''; + + return ` +
+ ${serviceCardHeader(svc)} +
+ ${modelCount !== undefined ? modelCount : '-'} + models +
+ ${cooldowns > 0 ? `
⚠ ${cooldowns} model${cooldowns > 1 ? 's' : ''} in cooldown
` : ''} +
+ ${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} + ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} +
+
+ `; +} + +function renderDBCard(svc) { + const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; + return ` +
+ ${serviceCardHeader(svc)} +
+ ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} + ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} +
+
+ `; +} + +function renderSearchCard(svc) { + const extra = svc.extra || {}; + const ms = extra.response_ms; + const httpStatus = svc.http_status; + const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : ''; + return ` +
+ ${serviceCardHeader(svc)} +
+ ${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)} + ${ms !== undefined ? serviceStatRow('Response', ms + 'ms', ms < 500 ? 'ok' : 'warn') : ''} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} +
+
+ `; +} + +function renderMCPCard(svc) { + const extra = svc.extra || {}; + const reachable = extra.port_reachable; + return ` +
+ ${serviceCardHeader(svc)} +
+ ${reachable !== undefined ? serviceStatRow('Port', reachable ? 'reachable' : 'unreachable', reachable ? 'ok' : 'bad') : ''} + ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} +
+
+ `; +} + +function renderVoiceCard(svc) { + const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; + return ` +
+ ${serviceCardHeader(svc)} +
+ ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} + ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} +
+
+ `; +} + +function renderAutomationCard(svc) { + const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; + return ` +
+ ${serviceCardHeader(svc)} +
+ ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} + ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} +
+
+ `; +} + +function renderGenericServiceCard(svc) { + return ` +
+ ${serviceCardHeader(svc)} +
+ ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} + ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} +
+
+ `; +} +``` + +**Step 8: Verify build** + +Run: `cd /home/will/lab/agentmon && go build ./...` +Expected: no errors + +**Step 9: Commit** + +```bash +git add cmd/web-ui/static/app.js cmd/web-ui/static/index.html +git commit -m "feat: rename OpenClaw to Infrastructure page, add service cards" +``` + +--- + +### Task 8: End-to-end verification + +**Step 1: Build all binaries** + +Run: `cd /home/will/lab/agentmon && go build ./...` +Expected: no errors + +**Step 2: Test docker label filtering manually** + +Run: `docker ps -a --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Labels}}\t{{.Status}}"` +Expected: lists swarm containers that are currently running with their labels + +**Step 3: Test swarm-monitor dry run** + +Run: +```bash +cd /home/will/lab/agentmon +NATS_URL=nats://localhost:4222 LITELLM_MASTER_KEY=$(source /home/will/lab/swarm/.env && echo $LITELLM_MASTER_KEY) \ + go run ./cmd/swarm-monitor/ 2>&1 | head -20 +``` +Expected: logs "swarm-monitor started", then either publishes events or logs connection errors (NATS may not be running locally — that's fine, look for the collection phase to succeed before the publish fails) + +**Step 4: Navigate to /infrastructure in browser** + +Open the web UI and navigate to `/infrastructure`. +Verify: +- Nav shows "Infra" link, active when on `/infrastructure` +- VMs section shows existing openclaw cards +- Services section shows either cards (if swarm events exist in DB) or "No swarm service data" + +**Step 5: Verify swarm strip on dashboard** + +Navigate to `/`. +Verify: +- VM strip still shows (zap/orb/sun) +- Swarm strip renders below it (may be empty if no `swarm.snapshot` events in DB yet) + +**Step 6: Final commit if any fixes needed** + +```bash +git add -A +git commit -m "fix: infrastructure page and swarm strip polish" +```