# Swarm Monitor Implementation Plan > **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. **Goal:** Add a `swarm-monitor` binary that polls docker-compose services in `~/lab/swarm`, emits `swarm.snapshot` and `swarm.service.snapshot` events to NATS, and surfaces service status on the dashboard strip and a new unified `/infrastructure` page (replacing `/openclaw`). **Architecture:** New `cmd/swarm-monitor/main.go` polls via `docker inspect` exec commands and HTTP probes, emitting two event types per poll. The existing NATS → event-processor → postgres → query-api pipeline requires zero changes. Frontend adds a swarm strip to the dashboard and merges VM cards + service cards on a renamed `/infrastructure` page. **Tech Stack:** Go (exec/docker CLI, net/http), vanilla JS, existing NATS publisher pattern --- ### Task 1: Add agentmon labels to docker-compose.yaml **Files:** - Modify: `/home/will/lab/swarm/docker-compose.yaml` **Step 1: Add labels to each service** Add a `labels:` block to each monitored service. `litellm-init` is a one-shot container — do NOT label it. For `whisper-server` (after its `healthcheck:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "voice" agentmon.port: "18801" ``` For `kokoro-tts` (after `restart: unless-stopped`): ```yaml labels: agentmon.monitor: "true" agentmon.role: "voice" agentmon.port: "18805" ``` For `brave-search` (after its `environment:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "mcp" agentmon.port: "18802" ``` For `searxng` (after its `volumes:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "search" agentmon.port: "18803" ``` For `litellm` (after its `healthcheck:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "llm-proxy" agentmon.port: "18804" ``` For `litellm-db` (after its `healthcheck:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "db" ``` For `n8n-agent` (after its `healthcheck:` block): ```yaml labels: agentmon.monitor: "true" agentmon.role: "automation" agentmon.port: "18808" ``` **Step 2: Verify labels appear in running containers** Run: `docker ps --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Status}}"` Expected: lists currently-running swarm containers (whichever profiles are active). **Step 3: Commit** ```bash cd /home/will/lab/swarm git add docker-compose.yaml git commit -m "feat: add agentmon monitor labels to swarm services" ``` --- ### Task 2: Create swarm types **Files:** - Create: `internal/monitor/swarm/types.go` **Step 1: Create the types file** ```go package swarm import "time" // ServiceSnapshot holds the collected state for one docker-compose service. type ServiceSnapshot struct { Name string `json:"name"` Role string `json:"role"` ContainerState string `json:"container_state"` // running/stopped/exited/missing HealthState string `json:"health_state"` // healthy/unhealthy/starting/none Status string `json:"status"` // healthy/degraded/down UptimeSec int64 `json:"uptime_sec,omitempty"` HTTPStatus *int `json:"http_status,omitempty"` Extra map[string]any `json:"extra,omitempty"` } // SwarmSnapshot holds a rolled-up snapshot of all labeled services. type SwarmSnapshot struct { Services []ServiceSnapshot `json:"services"` Issues Issues `json:"issues"` Timestamp time.Time `json:"timestamp"` } // Issues flags notable problems detected during a poll. type Issues struct { ServiceDown []string `json:"service_down,omitempty"` ServiceDegraded []string `json:"service_degraded,omitempty"` LLMCooldowns bool `json:"llm_cooldowns,omitempty"` } ``` **Step 2: Verify it compiles** Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/` Expected: no errors **Step 3: Commit** ```bash git add internal/monitor/swarm/types.go git commit -m "feat: add swarm monitor types" ``` --- ### Task 3: Create swarm collector **Files:** - Create: `internal/monitor/swarm/collector.go` **Step 1: Create the collector** ```go package swarm import ( "context" "encoding/json" "fmt" "net/http" "os/exec" "strconv" "strings" "time" ) // Config holds collector configuration. type Config struct { LiteLLMBaseURL string LiteLLMAPIKey string HTTPTimeout time.Duration } // dockerPsEntry is the JSON shape from `docker ps --format '{{json .}}'`. type dockerPsEntry struct { ID string `json:"ID"` Names string `json:"Names"` Status string `json:"Status"` State string `json:"State"` } // dockerInspectEntry is the minimal shape we need from `docker inspect`. type dockerInspectEntry struct { Name string `json:"Name"` State struct { Status string `json:"Status"` Running bool `json:"Running"` StartedAt string `json:"StartedAt"` Health *struct { Status string `json:"Status"` } `json:"Health"` } `json:"State"` Config struct { Labels map[string]string `json:"Labels"` } `json:"Config"` } // CollectAll lists all containers labeled agentmon.monitor=true and collects // a ServiceSnapshot for each. func CollectAll(ctx context.Context, cfg Config) ([]ServiceSnapshot, error) { // List labeled containers (running + stopped). out, err := exec.CommandContext(ctx, "docker", "ps", "-a", "--filter", "label=agentmon.monitor=true", "--format", "{{json .}}", ).Output() if err != nil { return nil, fmt.Errorf("docker ps failed: %w", err) } var entries []dockerPsEntry for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") { if line == "" { continue } var e dockerPsEntry if err := json.Unmarshal([]byte(line), &e); err != nil { continue } entries = append(entries, e) } client := &http.Client{Timeout: cfg.HTTPTimeout} var snapshots []ServiceSnapshot for _, e := range entries { snap := collectOne(ctx, e.Names, client, cfg) snapshots = append(snapshots, snap) } return snapshots, nil } func collectOne(ctx context.Context, name string, client *http.Client, cfg Config) ServiceSnapshot { snap := ServiceSnapshot{ Name: name, ContainerState: "missing", HealthState: "none", Status: "down", } // Inspect for detailed state. out, err := exec.CommandContext(ctx, "docker", "inspect", "--format", "{{json .}}", name).Output() if err != nil { return snap } var detail dockerInspectEntry if err := json.Unmarshal(out, &detail); err != nil { return snap } snap.Role = detail.Config.Labels["agentmon.role"] snap.ContainerState = detail.State.Status if detail.State.Health != nil { snap.HealthState = detail.State.Health.Status } // Calculate uptime if running. if detail.State.Running && detail.State.StartedAt != "" { if t, err := time.Parse(time.RFC3339Nano, detail.State.StartedAt); err == nil { snap.UptimeSec = int64(time.Since(t).Seconds()) } } // Role-specific probes. switch snap.Role { case "llm-proxy": collectLLMProxy(ctx, &snap, client, cfg) case "search": collectHTTPProbe(ctx, &snap, client, "http://localhost:"+detail.Config.Labels["agentmon.port"]+"/") case "mcp": collectPortProbe(ctx, &snap, detail.Config.Labels["agentmon.port"]) case "db", "voice", "automation": // Docker healthcheck state is sufficient; no HTTP probe. } snap.Status = deriveStatus(snap) return snap } func collectLLMProxy(ctx context.Context, snap *ServiceSnapshot, client *http.Client, cfg Config) { if snap.Extra == nil { snap.Extra = make(map[string]any) } // Health probe. req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/health/liveliness", nil) resp, err := client.Do(req) if err == nil { code := resp.StatusCode snap.HTTPStatus = &code resp.Body.Close() } // Model count. if cfg.LiteLLMAPIKey != "" { req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/v2/model/info", nil) req.Header.Set("Authorization", "Bearer "+cfg.LiteLLMAPIKey) resp, err := client.Do(req) if err == nil { defer resp.Body.Close() var result struct { Data []struct { ModelName string `json:"model_name"` } `json:"data"` } if json.NewDecoder(resp.Body).Decode(&result) == nil { snap.Extra["model_count"] = len(result.Data) } } } } func collectHTTPProbe(ctx context.Context, snap *ServiceSnapshot, client *http.Client, url string) { start := time.Now() req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) resp, err := client.Do(req) if err == nil { code := resp.StatusCode snap.HTTPStatus = &code resp.Body.Close() ms := time.Since(start).Milliseconds() if snap.Extra == nil { snap.Extra = make(map[string]any) } snap.Extra["response_ms"] = ms } } func collectPortProbe(ctx context.Context, snap *ServiceSnapshot, port string) { if port == "" { return } // Use nc to check TCP reachability. err := exec.CommandContext(ctx, "nc", "-z", "-w1", "localhost", port).Run() reachable := err == nil if snap.Extra == nil { snap.Extra = make(map[string]any) } snap.Extra["port_reachable"] = reachable } // deriveStatus computes the overall status from container state + health + probes. func deriveStatus(snap ServiceSnapshot) string { if snap.ContainerState != "running" { return "down" } if snap.HealthState == "unhealthy" { return "degraded" } if snap.HTTPStatus != nil && (*snap.HTTPStatus < 200 || *snap.HTTPStatus >= 400) { return "degraded" } if reachable, ok := snap.Extra["port_reachable"].(bool); ok && !reachable { return "degraded" } return "healthy" } // DetectIssues scans a set of snapshots for notable problems. func DetectIssues(services []ServiceSnapshot) Issues { issues := Issues{} for _, s := range services { switch s.Status { case "down": issues.ServiceDown = append(issues.ServiceDown, s.Name) case "degraded": issues.ServiceDegraded = append(issues.ServiceDegraded, s.Name) } if s.Role == "llm-proxy" { if extra := s.Extra; extra != nil { if count, ok := extra["cooldown_count"].(int); ok && count > 0 { issues.LLMCooldowns = true } } } } return issues } func intPtr(v int) *int { return &v } func _ = intPtr // suppress unused warning func _ = strconv.Itoa // imported for potential future use ``` **Step 2: Verify it compiles** Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/` Expected: no errors **Step 3: Commit** ```bash git add internal/monitor/swarm/collector.go git commit -m "feat: add swarm collector with docker inspect + HTTP probes" ``` --- ### Task 4: Create swarm-monitor binary **Files:** - Create: `cmd/swarm-monitor/main.go` **Step 1: Create the binary** ```go package main import ( "context" "encoding/json" "log" "os" "time" "agentmon/internal/monitor/swarm" qnats "agentmon/internal/queue/nats" ) func main() { natsURL := envDefault("NATS_URL", "nats://nats:4222") natsTopic := envDefault("NATS_TOPIC", "agentmon.events.v1") interval := envDefault("POLL_INTERVAL", "30s") litellmBase := envDefault("LITELLM_BASE_URL", "http://localhost:18804") litellmKey := os.Getenv("LITELLM_MASTER_KEY") pub, err := qnats.NewPublisher(natsURL, natsTopic) if err != nil { log.Fatalf("failed to connect to NATS: %v", err) } defer pub.Close() pollDuration, err := time.ParseDuration(interval) if err != nil { log.Fatalf("invalid poll interval: %v", err) } cfg := swarm.Config{ LiteLLMBaseURL: litellmBase, LiteLLMAPIKey: litellmKey, HTTPTimeout: 5 * time.Second, } ticker := time.NewTicker(pollDuration) defer ticker.Stop() ctx := context.Background() log.Printf("swarm-monitor started, polling every %s", pollDuration) // Poll immediately on start. if err := poll(ctx, pub, cfg); err != nil { log.Printf("initial poll error: %v", err) } for range ticker.C { if err := poll(ctx, pub, cfg); err != nil { log.Printf("poll error: %v", err) } } } func poll(ctx context.Context, pub *qnats.Publisher, cfg swarm.Config) error { services, err := swarm.CollectAll(ctx, cfg) if err != nil { return err } issues := swarm.DetectIssues(services) now := time.Now().UTC() // Emit rolled-up swarm.snapshot. if err := emit(ctx, pub, "swarm.snapshot", "agentmon.swarm", map[string]any{ "services": services, "issues": issues, }, now); err != nil { log.Printf("failed to emit swarm.snapshot: %v", err) } // Emit one swarm.service.snapshot per service. for _, svc := range services { if err := emit(ctx, pub, "swarm.service.snapshot", "agentmon.swarm.service", map[string]any{ "service": svc, }, now); err != nil { log.Printf("failed to emit swarm.service.snapshot for %s: %v", svc.Name, err) } } return nil } func emit(ctx context.Context, pub *qnats.Publisher, eventType, schemaName string, payload map[string]any, ts time.Time) error { event := map[string]any{ "schema": map[string]any{ "name": schemaName, "version": 1, }, "event": map[string]any{ "id": generateID(), "type": eventType, "ts": ts.Format(time.RFC3339Nano), }, "payload": payload, } data, err := json.Marshal(event) if err != nil { return err } return pub.Publish(ctx, data) } func generateID() string { return time.Now().Format("20060102150405") + "-" + randomString(8) } func randomString(n int) string { const chars = "abcdefghijklmnopqrstuvwxyz0123456789" b := make([]byte, n) for i := range b { b[i] = chars[time.Now().Nanosecond()%len(chars)] time.Sleep(time.Nanosecond) } return string(b) } func envDefault(key, def string) string { if v := os.Getenv(key); v != "" { return v } return def } ``` **Step 2: Verify it compiles** Run: `cd /home/will/lab/agentmon && go build ./cmd/swarm-monitor/` Expected: no errors **Step 3: Verify all binaries still build** Run: `cd /home/will/lab/agentmon && go build ./...` Expected: no errors **Step 4: Commit** ```bash git add cmd/swarm-monitor/main.go git commit -m "feat: add swarm-monitor binary" ``` --- ### Task 5: Dashboard swarm strip **Files:** - Modify: `cmd/web-ui/static/app.js` - Modify: `cmd/web-ui/static/style.css` **Step 1: Add swarmState and merge function to app.js** Near the top of the IIFE, alongside the existing `let openclawState = ...` declaration (line ~49), add: ```js let swarmState = { services: {} }; // keyed by service name ``` After the existing `mergeOpenClawEvents` function (~line 716), add: ```js function mergeSwarmSnapshot(evt) { const payload = getEnvelopePayload(evt); const services = payload.services || []; for (const svc of services) { if (svc.name) swarmState.services[svc.name] = svc; } } function mergeSwarmServiceSnapshot(evt) { const payload = getEnvelopePayload(evt); const svc = payload.service; if (svc && svc.name) swarmState.services[svc.name] = svc; } ``` **Step 2: Add swarm strip to renderDashboard** In `renderDashboard()`, the HTML template already has: ```html
``` Right after that line, add a swarm strip div: ```html
``` **Step 3: Add renderSwarmStrip function** After the `renderAgentVMStrip_dash` function (~line 1351), add: ```js function renderSwarmStrip_dash() { const strip = document.getElementById('dash-swarm-strip'); if (!strip) return; const services = Object.values(swarmState.services); if (services.length === 0) return; strip.innerHTML = services.map(svc => { const statusClass = svc.status === 'healthy' ? 'active' : svc.status === 'degraded' ? 'degraded' : 'inactive'; const label = svc.status || 'unknown'; return `
${escapeHTML(svc.name)} ${escapeHTML(label)}
`; }).join(''); } ``` **Step 4: Wire swarm strip into dashboard data load** In `renderDashboard()`, the `Promise.all` block loads initial data. After `mergeOpenClawEvents(snapshots.events || [])` and `renderAgentVMStrip_dash()`, add: ```js const swarmSnaps = await api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })); for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt); renderSwarmStrip_dash(); ``` Note: this needs to be inside the try block, before the `if (!isCurrentPath('/')) return;` guard. The simplest placement is to add it to the `Promise.all` array: Replace the `Promise.all` call in `renderDashboard` to add swarm snapshots: ```js const [summaryData, tsData, recentData, snapshots, swarmSnaps] = await Promise.all([ api('/v1/stats/summary'), api('/v1/stats/timeseries?window=1h'), api('/v1/events?limit=20'), api('/v1/events?event_type=openclaw.snapshot&limit=100').catch(() => ({ events: [] })), api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })), ]); ``` Then after `renderAgentVMStrip_dash()`: ```js for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt); renderSwarmStrip_dash(); ``` **Step 5: Handle swarm events in handleDashboardWS** In `handleDashboardWS`, after the `openclaw.snapshot` handler block, add: ```js if (eventType === 'swarm.snapshot') { mergeSwarmSnapshot(msg.data); renderSwarmStrip_dash(); return; } if (eventType === 'swarm.service.snapshot') { mergeSwarmServiceSnapshot(msg.data); renderSwarmStrip_dash(); return; } ``` **Step 6: Add swarm strip CSS** In `style.css`, after the `.vm-pill-label` block (~line 750), add: ```css /* ── Swarm strip ──────────────────────────────────────────── */ .swarm-strip { display: flex; flex-wrap: wrap; gap: 0.75rem; margin-bottom: 1.5rem; } .vm-pill.degraded { border-color: rgba(251, 191, 36, 0.3); } .vm-pill.degraded .vm-pill-dot { background: var(--warning); } ``` **Step 7: Verify no JS errors** Build check: `cd /home/will/lab/agentmon && go build ./...` Expected: no errors **Step 8: Commit** ```bash git add cmd/web-ui/static/app.js cmd/web-ui/static/style.css git commit -m "feat: add swarm strip to dashboard" ``` --- ### Task 6: Infrastructure page CSS **Files:** - Modify: `cmd/web-ui/static/style.css` **Step 1: Add infrastructure page styles** Append to the end of `style.css`: ```css /* ── Infrastructure page ──────────────────────────────────── */ .infra-section-title { font-family: var(--font-display); font-size: 0.75rem; font-weight: 700; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.12em; margin: 0 0 1rem 0; } .infra-section { margin-bottom: 2rem; } /* Service card grid */ .service-grid { display: grid; grid-template-columns: repeat(auto-fill, minmax(260px, 1fr)); gap: 1.25rem; } .service-card { background: var(--surface); border: 1px solid var(--border); border-radius: var(--radius-lg); padding: 1.125rem 1.25rem; display: flex; flex-direction: column; gap: 0.75rem; transition: border-color 0.2s; } .service-card:hover { border-color: rgba(34, 211, 238, 0.15); } .service-card-header { display: flex; align-items: center; justify-content: space-between; } .service-card-name { font-family: var(--font-mono); font-size: 0.88rem; font-weight: 600; color: var(--text-bright); } .service-badge { font-size: 0.65rem; font-weight: 700; text-transform: uppercase; letter-spacing: 0.08em; padding: 0.2rem 0.55rem; border-radius: 999px; } .service-badge.healthy { background: rgba(52, 211, 153, 0.12); color: var(--success); border: 1px solid rgba(52, 211, 153, 0.2); } .service-badge.degraded { background: rgba(251, 191, 36, 0.12); color: var(--warning); border: 1px solid rgba(251, 191, 36, 0.2); } .service-badge.down { background: rgba(248, 113, 113, 0.12); color: var(--error); border: 1px solid rgba(248, 113, 113, 0.2); } .service-role-tag { font-size: 0.65rem; font-family: var(--font-mono); color: var(--text-dim); margin-top: -0.25rem; } .service-stats { display: flex; flex-direction: column; gap: 0.3rem; font-size: 0.78rem; } .service-stat-row { display: flex; justify-content: space-between; align-items: center; } .service-stat-label { color: var(--text-dim); font-family: var(--font-mono); font-size: 0.72rem; } .service-stat-value { color: var(--text); font-family: var(--font-mono); font-size: 0.75rem; } .service-stat-value.ok { color: var(--success); } .service-stat-value.warn { color: var(--warning); } .service-stat-value.bad { color: var(--error); } /* LiteLLM cooldown warning */ .llm-cooldown-banner { background: rgba(251, 191, 36, 0.08); border: 1px solid rgba(251, 191, 36, 0.2); border-radius: var(--radius); padding: 0.4rem 0.625rem; font-size: 0.72rem; color: var(--warning); font-family: var(--font-mono); } /* LiteLLM model count highlight */ .llm-model-count { font-family: var(--font-display); font-size: 1.5rem; font-weight: 800; color: var(--text-bright); letter-spacing: -0.02em; line-height: 1; } .llm-model-label { font-size: 0.68rem; color: var(--text-dim); text-transform: uppercase; letter-spacing: 0.08em; } ``` **Step 2: Commit** ```bash git add cmd/web-ui/static/style.css git commit -m "feat: add infrastructure page CSS" ``` --- ### Task 7: Infrastructure page JS + nav rename **Files:** - Modify: `cmd/web-ui/static/app.js` - Modify: `cmd/web-ui/static/index.html` **Step 1: Update nav in index.html** Change the nav link from `OpenClaw` to `Infra` and update the href: Old: ```html ``` New: ```html ``` **Step 2: Update the router in app.js** Change line ~153: ```js } else if (path.startsWith('/openclaw')) { renderOpenClaw(); ``` to: ```js } else if (path.startsWith('/infrastructure')) { renderInfrastructure(); ``` **Step 3: Add infraUnsubscribe state variable** Near the existing `let openclawUnsubscribe = null;` declaration (~line 50), add: ```js let infraUnsubscribe = null; ``` **Step 4: Update cleanupLiveViews to clean up infra subscription** Find the `cleanupLiveViews` function (~line 107). Replace: ```js if (openclawUnsubscribe) { openclawUnsubscribe(); openclawUnsubscribe = null; } ``` with: ```js if (openclawUnsubscribe) { openclawUnsubscribe(); openclawUnsubscribe = null; } if (infraUnsubscribe) { infraUnsubscribe(); infraUnsubscribe = null; } ``` **Step 5: Replace renderOpenClaw with renderInfrastructure** Replace the existing `renderOpenClaw` function (lines ~664-680) entirely with: ```js async function renderInfrastructure() { app.innerHTML = '

Loading...

'; infraUnsubscribe = subscribeWS(handleInfraWS); try { const [ocData, swarmData] = await Promise.all([ api('/v1/events?event_type=openclaw.snapshot&limit=100'), api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })), ]); mergeOpenClawEvents(ocData.events || []); for (const evt of swarmData.events || []) mergeSwarmSnapshot(evt); if (isCurrentPath('/infrastructure')) { renderInfraGrid(); } } catch (e) { if (isCurrentPath('/infrastructure')) { app.innerHTML = `

Error: ${escapeHTML(e.message)}

`; } } } ``` **Step 6: Replace handleOpenClawWS with handleInfraWS** Replace the existing `handleOpenClawWS` function (lines ~682-699) with: ```js function handleInfraWS(msg) { if (msg.type !== 'message') return; const eventType = getEnvelopeType(msg.data); if (eventType === 'openclaw.snapshot') { mergeOpenClawEvents([msg.data]); if (isCurrentPath('/infrastructure')) renderInfraGrid(); if (isCurrentPath('/agents')) renderAgentVMStrip(); return; } if (eventType === 'swarm.snapshot') { mergeSwarmSnapshot(msg.data); if (isCurrentPath('/infrastructure')) renderInfraGrid(); renderSwarmStrip_dash(); return; } if (eventType === 'swarm.service.snapshot') { mergeSwarmServiceSnapshot(msg.data); if (isCurrentPath('/infrastructure')) renderInfraGrid(); renderSwarmStrip_dash(); return; } } ``` **Step 7: Add renderInfraGrid function** Replace the existing `renderOpenClawGrid` function (lines ~718-785) with a new `renderInfraGrid` that shows both VMs and service cards. Add it right after the new `handleInfraWS` function: ```js function renderInfraGrid() { const vmNames = Object.keys(openclawState.instances).sort(); const services = Object.values(swarmState.services); app.innerHTML = `

VMs

${vmNames.length === 0 ? '

No VM data

' : `
${vmNames.map(name => renderVMCard(name)).join('')}
` }

Services

${services.length === 0 ? '

No swarm service data

' : `
${services.map(svc => renderServiceCard(svc)).join('')}
` }
`; } function renderVMCard(name) { const evt = openclawState.instances[name]; const payload = getEnvelopePayload(evt); const inst = payload.instance || {}; const host = payload.host || {}; const guest = payload.guest; const issues = payload.issues; return `

${escapeHTML(inst.name || name)}

${host.state === 'running' ? 'Running' : 'Stopped'}
Updated ${escapeHTML(relativeTime(getEnvelopeTS(evt)))}
Host${escapeHTML(inst.host || '-')}
Domain${escapeHTML(inst.domain || '-')}
vCPUs${host.vcpus || '-'}
Memory${escapeHTML(formatBytes(host.memory_kib ? host.memory_kib * 1024 : 0) || '-')}
Disk${escapeHTML(formatBytes(host.disk_actual_bytes) || '-')}
Autostart${host.autostart ? 'Yes' : 'No'}
${guest ? `
Gateway${guest.service_active ? 'Active' : 'Inactive'}
HTTP${guest.http_status || 'N/A'}
Version${escapeHTML(guest.version || '-')}
Guest Mem${guest.memory_percent !== undefined ? guest.memory_percent.toFixed(1) : '-'}%
Guest Disk${guest.disk_percent !== undefined ? guest.disk_percent.toFixed(1) : '-'}%
Load${guest.load_average !== undefined ? guest.load_average.toFixed(2) : '-'}
Uptime${escapeHTML(guest.service_uptime || '-')}
` : ''} ${issues && Object.values(issues).some(Boolean) ? `
Issues
${Object.entries(issues).filter(([, value]) => value).map(([key]) => ` ${escapeHTML(key.replace(/_/g, ' '))} `).join('')}
` : ''}
`; } function renderServiceCard(svc) { const role = svc.role || 'unknown'; switch (role) { case 'llm-proxy': return renderLLMProxyCard(svc); case 'db': return renderDBCard(svc); case 'search': return renderSearchCard(svc); case 'mcp': return renderMCPCard(svc); case 'voice': return renderVoiceCard(svc); case 'automation':return renderAutomationCard(svc); default: return renderGenericServiceCard(svc); } } function serviceCardHeader(svc) { return `
${escapeHTML(svc.name)}
${escapeHTML(svc.role || '')}
${escapeHTML(svc.status || 'down')}
`; } function serviceStatRow(label, value, valueClass) { return `
${escapeHTML(label)} ${value}
`; } function formatUptime(sec) { if (!sec) return '-'; if (sec < 60) return sec + 's'; if (sec < 3600) return Math.floor(sec / 60) + 'm'; if (sec < 86400) return Math.floor(sec / 3600) + 'h ' + Math.floor((sec % 3600) / 60) + 'm'; return Math.floor(sec / 86400) + 'd ' + Math.floor((sec % 86400) / 3600) + 'h'; } function renderLLMProxyCard(svc) { const extra = svc.extra || {}; const modelCount = extra.model_count; const cooldowns = extra.cooldown_count || 0; const httpStatus = svc.http_status; const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : ''; return `
${serviceCardHeader(svc)}
${modelCount !== undefined ? modelCount : '-'} models
${cooldowns > 0 ? `
⚠ ${cooldowns} model${cooldowns > 1 ? 's' : ''} in cooldown
` : ''}
${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
`; } function renderDBCard(svc) { const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; return `
${serviceCardHeader(svc)}
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')} ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
`; } function renderSearchCard(svc) { const extra = svc.extra || {}; const ms = extra.response_ms; const httpStatus = svc.http_status; const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : ''; return `
${serviceCardHeader(svc)}
${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)} ${ms !== undefined ? serviceStatRow('Response', ms + 'ms', ms < 500 ? 'ok' : 'warn') : ''} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
`; } function renderMCPCard(svc) { const extra = svc.extra || {}; const reachable = extra.port_reachable; return `
${serviceCardHeader(svc)}
${reachable !== undefined ? serviceStatRow('Port', reachable ? 'reachable' : 'unreachable', reachable ? 'ok' : 'bad') : ''} ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
`; } function renderVoiceCard(svc) { const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; return `
${serviceCardHeader(svc)}
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
`; } function renderAutomationCard(svc) { const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : ''; return `
${serviceCardHeader(svc)}
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)} ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
`; } function renderGenericServiceCard(svc) { return `
${serviceCardHeader(svc)}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')} ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
`; } ``` **Step 8: Verify build** Run: `cd /home/will/lab/agentmon && go build ./...` Expected: no errors **Step 9: Commit** ```bash git add cmd/web-ui/static/app.js cmd/web-ui/static/index.html git commit -m "feat: rename OpenClaw to Infrastructure page, add service cards" ``` --- ### Task 8: End-to-end verification **Step 1: Build all binaries** Run: `cd /home/will/lab/agentmon && go build ./...` Expected: no errors **Step 2: Test docker label filtering manually** Run: `docker ps -a --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Labels}}\t{{.Status}}"` Expected: lists swarm containers that are currently running with their labels **Step 3: Test swarm-monitor dry run** Run: ```bash cd /home/will/lab/agentmon NATS_URL=nats://localhost:4222 LITELLM_MASTER_KEY=$(source /home/will/lab/swarm/.env && echo $LITELLM_MASTER_KEY) \ go run ./cmd/swarm-monitor/ 2>&1 | head -20 ``` Expected: logs "swarm-monitor started", then either publishes events or logs connection errors (NATS may not be running locally — that's fine, look for the collection phase to succeed before the publish fails) **Step 4: Navigate to /infrastructure in browser** Open the web UI and navigate to `/infrastructure`. Verify: - Nav shows "Infra" link, active when on `/infrastructure` - VMs section shows existing openclaw cards - Services section shows either cards (if swarm events exist in DB) or "No swarm service data" **Step 5: Verify swarm strip on dashboard** Navigate to `/`. Verify: - VM strip still shows (zap/orb/sun) - Swarm strip renders below it (may be empty if no `swarm.snapshot` events in DB yet) **Step 6: Final commit if any fixes needed** ```bash git add -A git commit -m "fix: infrastructure page and swarm strip polish" ```