Files
agentmon/docs/plans/2026-03-18-swarm-monitor-plan.md
T
2026-03-18 09:57:51 -07:00

35 KiB

Swarm Monitor Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add a swarm-monitor binary that polls docker-compose services in ~/lab/swarm, emits swarm.snapshot and swarm.service.snapshot events to NATS, and surfaces service status on the dashboard strip and a new unified /infrastructure page (replacing /openclaw).

Architecture: New cmd/swarm-monitor/main.go polls via docker inspect exec commands and HTTP probes, emitting two event types per poll. The existing NATS → event-processor → postgres → query-api pipeline requires zero changes. Frontend adds a swarm strip to the dashboard and merges VM cards + service cards on a renamed /infrastructure page.

Tech Stack: Go (exec/docker CLI, net/http), vanilla JS, existing NATS publisher pattern


Task 1: Add agentmon labels to docker-compose.yaml

Files:

  • Modify: /home/will/lab/swarm/docker-compose.yaml

Step 1: Add labels to each service

Add a labels: block to each monitored service. litellm-init is a one-shot container — do NOT label it.

For whisper-server (after its healthcheck: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "voice"
      agentmon.port: "18801"

For kokoro-tts (after restart: unless-stopped):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "voice"
      agentmon.port: "18805"

For brave-search (after its environment: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "mcp"
      agentmon.port: "18802"

For searxng (after its volumes: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "search"
      agentmon.port: "18803"

For litellm (after its healthcheck: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "llm-proxy"
      agentmon.port: "18804"

For litellm-db (after its healthcheck: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "db"

For n8n-agent (after its healthcheck: block):

    labels:
      agentmon.monitor: "true"
      agentmon.role: "automation"
      agentmon.port: "18808"

Step 2: Verify labels appear in running containers

Run: docker ps --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Status}}"

Expected: lists currently-running swarm containers (whichever profiles are active).

Step 3: Commit

cd /home/will/lab/swarm
git add docker-compose.yaml
git commit -m "feat: add agentmon monitor labels to swarm services"

Task 2: Create swarm types

Files:

  • Create: internal/monitor/swarm/types.go

Step 1: Create the types file

package swarm

import "time"

// ServiceSnapshot holds the collected state for one docker-compose service.
type ServiceSnapshot struct {
	Name           string         `json:"name"`
	Role           string         `json:"role"`
	ContainerState string         `json:"container_state"` // running/stopped/exited/missing
	HealthState    string         `json:"health_state"`    // healthy/unhealthy/starting/none
	Status         string         `json:"status"`          // healthy/degraded/down
	UptimeSec      int64          `json:"uptime_sec,omitempty"`
	HTTPStatus     *int           `json:"http_status,omitempty"`
	Extra          map[string]any `json:"extra,omitempty"`
}

// SwarmSnapshot holds a rolled-up snapshot of all labeled services.
type SwarmSnapshot struct {
	Services  []ServiceSnapshot `json:"services"`
	Issues    Issues            `json:"issues"`
	Timestamp time.Time         `json:"timestamp"`
}

// Issues flags notable problems detected during a poll.
type Issues struct {
	ServiceDown     []string `json:"service_down,omitempty"`
	ServiceDegraded []string `json:"service_degraded,omitempty"`
	LLMCooldowns    bool     `json:"llm_cooldowns,omitempty"`
}

Step 2: Verify it compiles

Run: cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/ Expected: no errors

Step 3: Commit

git add internal/monitor/swarm/types.go
git commit -m "feat: add swarm monitor types"

Task 3: Create swarm collector

Files:

  • Create: internal/monitor/swarm/collector.go

Step 1: Create the collector

package swarm

import (
	"context"
	"encoding/json"
	"fmt"
	"net/http"
	"os/exec"
	"strconv"
	"strings"
	"time"
)

// Config holds collector configuration.
type Config struct {
	LiteLLMBaseURL string
	LiteLLMAPIKey  string
	HTTPTimeout    time.Duration
}

// dockerPsEntry is the JSON shape from `docker ps --format '{{json .}}'`.
type dockerPsEntry struct {
	ID     string `json:"ID"`
	Names  string `json:"Names"`
	Status string `json:"Status"`
	State  string `json:"State"`
}

// dockerInspectEntry is the minimal shape we need from `docker inspect`.
type dockerInspectEntry struct {
	Name  string `json:"Name"`
	State struct {
		Status    string `json:"Status"`
		Running   bool   `json:"Running"`
		StartedAt string `json:"StartedAt"`
		Health    *struct {
			Status string `json:"Status"`
		} `json:"Health"`
	} `json:"State"`
	Config struct {
		Labels map[string]string `json:"Labels"`
	} `json:"Config"`
}

// CollectAll lists all containers labeled agentmon.monitor=true and collects
// a ServiceSnapshot for each.
func CollectAll(ctx context.Context, cfg Config) ([]ServiceSnapshot, error) {
	// List labeled containers (running + stopped).
	out, err := exec.CommandContext(ctx, "docker", "ps", "-a",
		"--filter", "label=agentmon.monitor=true",
		"--format", "{{json .}}",
	).Output()
	if err != nil {
		return nil, fmt.Errorf("docker ps failed: %w", err)
	}

	var entries []dockerPsEntry
	for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
		if line == "" {
			continue
		}
		var e dockerPsEntry
		if err := json.Unmarshal([]byte(line), &e); err != nil {
			continue
		}
		entries = append(entries, e)
	}

	client := &http.Client{Timeout: cfg.HTTPTimeout}
	var snapshots []ServiceSnapshot
	for _, e := range entries {
		snap := collectOne(ctx, e.Names, client, cfg)
		snapshots = append(snapshots, snap)
	}

	return snapshots, nil
}

func collectOne(ctx context.Context, name string, client *http.Client, cfg Config) ServiceSnapshot {
	snap := ServiceSnapshot{
		Name:           name,
		ContainerState: "missing",
		HealthState:    "none",
		Status:         "down",
	}

	// Inspect for detailed state.
	out, err := exec.CommandContext(ctx, "docker", "inspect", "--format", "{{json .}}", name).Output()
	if err != nil {
		return snap
	}

	var detail dockerInspectEntry
	if err := json.Unmarshal(out, &detail); err != nil {
		return snap
	}

	snap.Role = detail.Config.Labels["agentmon.role"]
	snap.ContainerState = detail.State.Status

	if detail.State.Health != nil {
		snap.HealthState = detail.State.Health.Status
	}

	// Calculate uptime if running.
	if detail.State.Running && detail.State.StartedAt != "" {
		if t, err := time.Parse(time.RFC3339Nano, detail.State.StartedAt); err == nil {
			snap.UptimeSec = int64(time.Since(t).Seconds())
		}
	}

	// Role-specific probes.
	switch snap.Role {
	case "llm-proxy":
		collectLLMProxy(ctx, &snap, client, cfg)
	case "search":
		collectHTTPProbe(ctx, &snap, client, "http://localhost:"+detail.Config.Labels["agentmon.port"]+"/")
	case "mcp":
		collectPortProbe(ctx, &snap, detail.Config.Labels["agentmon.port"])
	case "db", "voice", "automation":
		// Docker healthcheck state is sufficient; no HTTP probe.
	}

	snap.Status = deriveStatus(snap)
	return snap
}

func collectLLMProxy(ctx context.Context, snap *ServiceSnapshot, client *http.Client, cfg Config) {
	if snap.Extra == nil {
		snap.Extra = make(map[string]any)
	}

	// Health probe.
	req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/health/liveliness", nil)
	resp, err := client.Do(req)
	if err == nil {
		code := resp.StatusCode
		snap.HTTPStatus = &code
		resp.Body.Close()
	}

	// Model count.
	if cfg.LiteLLMAPIKey != "" {
		req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/v2/model/info", nil)
		req.Header.Set("Authorization", "Bearer "+cfg.LiteLLMAPIKey)
		resp, err := client.Do(req)
		if err == nil {
			defer resp.Body.Close()
			var result struct {
				Data []struct {
					ModelName string `json:"model_name"`
				} `json:"data"`
			}
			if json.NewDecoder(resp.Body).Decode(&result) == nil {
				snap.Extra["model_count"] = len(result.Data)
			}
		}
	}
}

func collectHTTPProbe(ctx context.Context, snap *ServiceSnapshot, client *http.Client, url string) {
	start := time.Now()
	req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
	resp, err := client.Do(req)
	if err == nil {
		code := resp.StatusCode
		snap.HTTPStatus = &code
		resp.Body.Close()
		ms := time.Since(start).Milliseconds()
		if snap.Extra == nil {
			snap.Extra = make(map[string]any)
		}
		snap.Extra["response_ms"] = ms
	}
}

func collectPortProbe(ctx context.Context, snap *ServiceSnapshot, port string) {
	if port == "" {
		return
	}
	// Use nc to check TCP reachability.
	err := exec.CommandContext(ctx, "nc", "-z", "-w1", "localhost", port).Run()
	reachable := err == nil
	if snap.Extra == nil {
		snap.Extra = make(map[string]any)
	}
	snap.Extra["port_reachable"] = reachable
}

// deriveStatus computes the overall status from container state + health + probes.
func deriveStatus(snap ServiceSnapshot) string {
	if snap.ContainerState != "running" {
		return "down"
	}
	if snap.HealthState == "unhealthy" {
		return "degraded"
	}
	if snap.HTTPStatus != nil && (*snap.HTTPStatus < 200 || *snap.HTTPStatus >= 400) {
		return "degraded"
	}
	if reachable, ok := snap.Extra["port_reachable"].(bool); ok && !reachable {
		return "degraded"
	}
	return "healthy"
}

// DetectIssues scans a set of snapshots for notable problems.
func DetectIssues(services []ServiceSnapshot) Issues {
	issues := Issues{}
	for _, s := range services {
		switch s.Status {
		case "down":
			issues.ServiceDown = append(issues.ServiceDown, s.Name)
		case "degraded":
			issues.ServiceDegraded = append(issues.ServiceDegraded, s.Name)
		}
		if s.Role == "llm-proxy" {
			if extra := s.Extra; extra != nil {
				if count, ok := extra["cooldown_count"].(int); ok && count > 0 {
					issues.LLMCooldowns = true
				}
			}
		}
	}
	return issues
}

func intPtr(v int) *int { return &v }
func _ = intPtr         // suppress unused warning
func _ = strconv.Itoa   // imported for potential future use

Step 2: Verify it compiles

Run: cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/ Expected: no errors

Step 3: Commit

git add internal/monitor/swarm/collector.go
git commit -m "feat: add swarm collector with docker inspect + HTTP probes"

Task 4: Create swarm-monitor binary

Files:

  • Create: cmd/swarm-monitor/main.go

Step 1: Create the binary

package main

import (
	"context"
	"encoding/json"
	"log"
	"os"
	"time"

	"agentmon/internal/monitor/swarm"
	qnats "agentmon/internal/queue/nats"
)

func main() {
	natsURL := envDefault("NATS_URL", "nats://nats:4222")
	natsTopic := envDefault("NATS_TOPIC", "agentmon.events.v1")
	interval := envDefault("POLL_INTERVAL", "30s")
	litellmBase := envDefault("LITELLM_BASE_URL", "http://localhost:18804")
	litellmKey := os.Getenv("LITELLM_MASTER_KEY")

	pub, err := qnats.NewPublisher(natsURL, natsTopic)
	if err != nil {
		log.Fatalf("failed to connect to NATS: %v", err)
	}
	defer pub.Close()

	pollDuration, err := time.ParseDuration(interval)
	if err != nil {
		log.Fatalf("invalid poll interval: %v", err)
	}

	cfg := swarm.Config{
		LiteLLMBaseURL: litellmBase,
		LiteLLMAPIKey:  litellmKey,
		HTTPTimeout:    5 * time.Second,
	}

	ticker := time.NewTicker(pollDuration)
	defer ticker.Stop()

	ctx := context.Background()
	log.Printf("swarm-monitor started, polling every %s", pollDuration)

	// Poll immediately on start.
	if err := poll(ctx, pub, cfg); err != nil {
		log.Printf("initial poll error: %v", err)
	}

	for range ticker.C {
		if err := poll(ctx, pub, cfg); err != nil {
			log.Printf("poll error: %v", err)
		}
	}
}

func poll(ctx context.Context, pub *qnats.Publisher, cfg swarm.Config) error {
	services, err := swarm.CollectAll(ctx, cfg)
	if err != nil {
		return err
	}

	issues := swarm.DetectIssues(services)
	now := time.Now().UTC()

	// Emit rolled-up swarm.snapshot.
	if err := emit(ctx, pub, "swarm.snapshot", "agentmon.swarm", map[string]any{
		"services": services,
		"issues":   issues,
	}, now); err != nil {
		log.Printf("failed to emit swarm.snapshot: %v", err)
	}

	// Emit one swarm.service.snapshot per service.
	for _, svc := range services {
		if err := emit(ctx, pub, "swarm.service.snapshot", "agentmon.swarm.service", map[string]any{
			"service": svc,
		}, now); err != nil {
			log.Printf("failed to emit swarm.service.snapshot for %s: %v", svc.Name, err)
		}
	}

	return nil
}

func emit(ctx context.Context, pub *qnats.Publisher, eventType, schemaName string, payload map[string]any, ts time.Time) error {
	event := map[string]any{
		"schema": map[string]any{
			"name":    schemaName,
			"version": 1,
		},
		"event": map[string]any{
			"id":   generateID(),
			"type": eventType,
			"ts":   ts.Format(time.RFC3339Nano),
		},
		"payload": payload,
	}

	data, err := json.Marshal(event)
	if err != nil {
		return err
	}

	return pub.Publish(ctx, data)
}

func generateID() string {
	return time.Now().Format("20060102150405") + "-" + randomString(8)
}

func randomString(n int) string {
	const chars = "abcdefghijklmnopqrstuvwxyz0123456789"
	b := make([]byte, n)
	for i := range b {
		b[i] = chars[time.Now().Nanosecond()%len(chars)]
		time.Sleep(time.Nanosecond)
	}
	return string(b)
}

func envDefault(key, def string) string {
	if v := os.Getenv(key); v != "" {
		return v
	}
	return def
}

Step 2: Verify it compiles

Run: cd /home/will/lab/agentmon && go build ./cmd/swarm-monitor/ Expected: no errors

Step 3: Verify all binaries still build

Run: cd /home/will/lab/agentmon && go build ./... Expected: no errors

Step 4: Commit

git add cmd/swarm-monitor/main.go
git commit -m "feat: add swarm-monitor binary"

Task 5: Dashboard swarm strip

Files:

  • Modify: cmd/web-ui/static/app.js
  • Modify: cmd/web-ui/static/style.css

Step 1: Add swarmState and merge function to app.js

Near the top of the IIFE, alongside the existing let openclawState = ... declaration (line ~49), add:

let swarmState = { services: {} }; // keyed by service name

After the existing mergeOpenClawEvents function (~line 716), add:

function mergeSwarmSnapshot(evt) {
  const payload = getEnvelopePayload(evt);
  const services = payload.services || [];
  for (const svc of services) {
    if (svc.name) swarmState.services[svc.name] = svc;
  }
}

function mergeSwarmServiceSnapshot(evt) {
  const payload = getEnvelopePayload(evt);
  const svc = payload.service;
  if (svc && svc.name) swarmState.services[svc.name] = svc;
}

Step 2: Add swarm strip to renderDashboard

In renderDashboard(), the HTML template already has:

<div class="vm-strip" id="dash-vm-strip" style="margin-bottom:1.5rem"></div>

Right after that line, add a swarm strip div:

<div class="swarm-strip" id="dash-swarm-strip"></div>

Step 3: Add renderSwarmStrip function

After the renderAgentVMStrip_dash function (~line 1351), add:

function renderSwarmStrip_dash() {
  const strip = document.getElementById('dash-swarm-strip');
  if (!strip) return;
  const services = Object.values(swarmState.services);
  if (services.length === 0) return;
  strip.innerHTML = services.map(svc => {
    const statusClass = svc.status === 'healthy' ? 'active'
      : svc.status === 'degraded' ? 'degraded' : 'inactive';
    const label = svc.status || 'unknown';
    return `
      <div class="vm-pill ${statusClass}">
        <span class="vm-pill-dot"></span>
        <span class="vm-pill-name">${escapeHTML(svc.name)}</span>
        <span class="vm-pill-label">${escapeHTML(label)}</span>
      </div>
    `;
  }).join('');
}

Step 4: Wire swarm strip into dashboard data load

In renderDashboard(), the Promise.all block loads initial data. After mergeOpenClawEvents(snapshots.events || []) and renderAgentVMStrip_dash(), add:

const swarmSnaps = await api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] }));
for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt);
renderSwarmStrip_dash();

Note: this needs to be inside the try block, before the if (!isCurrentPath('/')) return; guard. The simplest placement is to add it to the Promise.all array:

Replace the Promise.all call in renderDashboard to add swarm snapshots:

const [summaryData, tsData, recentData, snapshots, swarmSnaps] = await Promise.all([
  api('/v1/stats/summary'),
  api('/v1/stats/timeseries?window=1h'),
  api('/v1/events?limit=20'),
  api('/v1/events?event_type=openclaw.snapshot&limit=100').catch(() => ({ events: [] })),
  api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })),
]);

Then after renderAgentVMStrip_dash():

for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt);
renderSwarmStrip_dash();

Step 5: Handle swarm events in handleDashboardWS

In handleDashboardWS, after the openclaw.snapshot handler block, add:

if (eventType === 'swarm.snapshot') {
  mergeSwarmSnapshot(msg.data);
  renderSwarmStrip_dash();
  return;
}
if (eventType === 'swarm.service.snapshot') {
  mergeSwarmServiceSnapshot(msg.data);
  renderSwarmStrip_dash();
  return;
}

Step 6: Add swarm strip CSS

In style.css, after the .vm-pill-label block (~line 750), add:

/* ── Swarm strip ──────────────────────────────────────────── */
.swarm-strip {
  display: flex;
  flex-wrap: wrap;
  gap: 0.75rem;
  margin-bottom: 1.5rem;
}

.vm-pill.degraded {
  border-color: rgba(251, 191, 36, 0.3);
}

.vm-pill.degraded .vm-pill-dot {
  background: var(--warning);
}

Step 7: Verify no JS errors

Build check: cd /home/will/lab/agentmon && go build ./... Expected: no errors

Step 8: Commit

git add cmd/web-ui/static/app.js cmd/web-ui/static/style.css
git commit -m "feat: add swarm strip to dashboard"

Task 6: Infrastructure page CSS

Files:

  • Modify: cmd/web-ui/static/style.css

Step 1: Add infrastructure page styles

Append to the end of style.css:

/* ── Infrastructure page ──────────────────────────────────── */
.infra-section-title {
  font-family: var(--font-display);
  font-size: 0.75rem;
  font-weight: 700;
  color: var(--text-dim);
  text-transform: uppercase;
  letter-spacing: 0.12em;
  margin: 0 0 1rem 0;
}

.infra-section {
  margin-bottom: 2rem;
}

/* Service card grid */
.service-grid {
  display: grid;
  grid-template-columns: repeat(auto-fill, minmax(260px, 1fr));
  gap: 1.25rem;
}

.service-card {
  background: var(--surface);
  border: 1px solid var(--border);
  border-radius: var(--radius-lg);
  padding: 1.125rem 1.25rem;
  display: flex;
  flex-direction: column;
  gap: 0.75rem;
  transition: border-color 0.2s;
}

.service-card:hover {
  border-color: rgba(34, 211, 238, 0.15);
}

.service-card-header {
  display: flex;
  align-items: center;
  justify-content: space-between;
}

.service-card-name {
  font-family: var(--font-mono);
  font-size: 0.88rem;
  font-weight: 600;
  color: var(--text-bright);
}

.service-badge {
  font-size: 0.65rem;
  font-weight: 700;
  text-transform: uppercase;
  letter-spacing: 0.08em;
  padding: 0.2rem 0.55rem;
  border-radius: 999px;
}

.service-badge.healthy {
  background: rgba(52, 211, 153, 0.12);
  color: var(--success);
  border: 1px solid rgba(52, 211, 153, 0.2);
}

.service-badge.degraded {
  background: rgba(251, 191, 36, 0.12);
  color: var(--warning);
  border: 1px solid rgba(251, 191, 36, 0.2);
}

.service-badge.down {
  background: rgba(248, 113, 113, 0.12);
  color: var(--error);
  border: 1px solid rgba(248, 113, 113, 0.2);
}

.service-role-tag {
  font-size: 0.65rem;
  font-family: var(--font-mono);
  color: var(--text-dim);
  margin-top: -0.25rem;
}

.service-stats {
  display: flex;
  flex-direction: column;
  gap: 0.3rem;
  font-size: 0.78rem;
}

.service-stat-row {
  display: flex;
  justify-content: space-between;
  align-items: center;
}

.service-stat-label {
  color: var(--text-dim);
  font-family: var(--font-mono);
  font-size: 0.72rem;
}

.service-stat-value {
  color: var(--text);
  font-family: var(--font-mono);
  font-size: 0.75rem;
}

.service-stat-value.ok   { color: var(--success); }
.service-stat-value.warn { color: var(--warning); }
.service-stat-value.bad  { color: var(--error); }

/* LiteLLM cooldown warning */
.llm-cooldown-banner {
  background: rgba(251, 191, 36, 0.08);
  border: 1px solid rgba(251, 191, 36, 0.2);
  border-radius: var(--radius);
  padding: 0.4rem 0.625rem;
  font-size: 0.72rem;
  color: var(--warning);
  font-family: var(--font-mono);
}

/* LiteLLM model count highlight */
.llm-model-count {
  font-family: var(--font-display);
  font-size: 1.5rem;
  font-weight: 800;
  color: var(--text-bright);
  letter-spacing: -0.02em;
  line-height: 1;
}

.llm-model-label {
  font-size: 0.68rem;
  color: var(--text-dim);
  text-transform: uppercase;
  letter-spacing: 0.08em;
}

Step 2: Commit

git add cmd/web-ui/static/style.css
git commit -m "feat: add infrastructure page CSS"

Task 7: Infrastructure page JS + nav rename

Files:

  • Modify: cmd/web-ui/static/app.js
  • Modify: cmd/web-ui/static/index.html

Step 1: Update nav in index.html

Change the nav link from OpenClaw to Infra and update the href:

Old:

<nav><a href="/">Dashboard</a><a href="/sessions">Sessions</a><a href="/agents">Agents</a><a href="/openclaw">OpenClaw</a></nav>

New:

<nav><a href="/">Dashboard</a><a href="/sessions">Sessions</a><a href="/agents">Agents</a><a href="/infrastructure">Infra</a></nav>

Step 2: Update the router in app.js

Change line ~153:

} else if (path.startsWith('/openclaw')) {
  renderOpenClaw();

to:

} else if (path.startsWith('/infrastructure')) {
  renderInfrastructure();

Step 3: Add infraUnsubscribe state variable

Near the existing let openclawUnsubscribe = null; declaration (~line 50), add:

let infraUnsubscribe = null;

Step 4: Update cleanupLiveViews to clean up infra subscription

Find the cleanupLiveViews function (~line 107). Replace:

if (openclawUnsubscribe) {
  openclawUnsubscribe();
  openclawUnsubscribe = null;
}

with:

if (openclawUnsubscribe) {
  openclawUnsubscribe();
  openclawUnsubscribe = null;
}
if (infraUnsubscribe) {
  infraUnsubscribe();
  infraUnsubscribe = null;
}

Step 5: Replace renderOpenClaw with renderInfrastructure

Replace the existing renderOpenClaw function (lines ~664-680) entirely with:

async function renderInfrastructure() {
  app.innerHTML = '<div class="page-header"><h2>Infrastructure</h2></div><p class="empty-state">Loading...</p>';

  infraUnsubscribe = subscribeWS(handleInfraWS);

  try {
    const [ocData, swarmData] = await Promise.all([
      api('/v1/events?event_type=openclaw.snapshot&limit=100'),
      api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })),
    ]);

    mergeOpenClawEvents(ocData.events || []);
    for (const evt of swarmData.events || []) mergeSwarmSnapshot(evt);

    if (isCurrentPath('/infrastructure')) {
      renderInfraGrid();
    }
  } catch (e) {
    if (isCurrentPath('/infrastructure')) {
      app.innerHTML = `<div class="page-header"><h2>Infrastructure</h2></div><p class="empty-state">Error: ${escapeHTML(e.message)}</p>`;
    }
  }
}

Step 6: Replace handleOpenClawWS with handleInfraWS

Replace the existing handleOpenClawWS function (lines ~682-699) with:

function handleInfraWS(msg) {
  if (msg.type !== 'message') return;

  const eventType = getEnvelopeType(msg.data);

  if (eventType === 'openclaw.snapshot') {
    mergeOpenClawEvents([msg.data]);
    if (isCurrentPath('/infrastructure')) renderInfraGrid();
    if (isCurrentPath('/agents')) renderAgentVMStrip();
    return;
  }

  if (eventType === 'swarm.snapshot') {
    mergeSwarmSnapshot(msg.data);
    if (isCurrentPath('/infrastructure')) renderInfraGrid();
    renderSwarmStrip_dash();
    return;
  }

  if (eventType === 'swarm.service.snapshot') {
    mergeSwarmServiceSnapshot(msg.data);
    if (isCurrentPath('/infrastructure')) renderInfraGrid();
    renderSwarmStrip_dash();
    return;
  }
}

Step 7: Add renderInfraGrid function

Replace the existing renderOpenClawGrid function (lines ~718-785) with a new renderInfraGrid that shows both VMs and service cards. Add it right after the new handleInfraWS function:

function renderInfraGrid() {
  const vmNames = Object.keys(openclawState.instances).sort();
  const services = Object.values(swarmState.services);

  app.innerHTML = `
    <div class="page-header">
      <h2>Infrastructure <span class="live-indicator"><span class="live-dot"></span>Live</span></h2>
    </div>

    <div class="infra-section">
      <p class="infra-section-title">VMs</p>
      ${vmNames.length === 0
        ? '<p class="empty-state">No VM data</p>'
        : `<div class="vm-grid">${vmNames.map(name => renderVMCard(name)).join('')}</div>`
      }
    </div>

    <div class="infra-section">
      <p class="infra-section-title">Services</p>
      ${services.length === 0
        ? '<p class="empty-state">No swarm service data</p>'
        : `<div class="service-grid">${services.map(svc => renderServiceCard(svc)).join('')}</div>`
      }
    </div>
  `;
}

function renderVMCard(name) {
  const evt = openclawState.instances[name];
  const payload = getEnvelopePayload(evt);
  const inst = payload.instance || {};
  const host = payload.host || {};
  const guest = payload.guest;
  const issues = payload.issues;

  return `
    <div class="vm-card">
      <div class="vm-card-header">
        <h3>${escapeHTML(inst.name || name)}</h3>
        <div class="vm-status ${host.state === 'running' ? 'running' : 'stopped'}">
          ${host.state === 'running' ? 'Running' : 'Stopped'}
        </div>
      </div>
      <div class="vm-updated">Updated ${escapeHTML(relativeTime(getEnvelopeTS(evt)))}</div>
      <table class="vm-stats">
        <tr><td>Host</td><td>${escapeHTML(inst.host || '-')}</td></tr>
        <tr><td>Domain</td><td>${escapeHTML(inst.domain || '-')}</td></tr>
        <tr><td>vCPUs</td><td>${host.vcpus || '-'}</td></tr>
        <tr><td>Memory</td><td>${escapeHTML(formatBytes(host.memory_kib ? host.memory_kib * 1024 : 0) || '-')}</td></tr>
        <tr><td>Disk</td><td>${escapeHTML(formatBytes(host.disk_actual_bytes) || '-')}</td></tr>
        <tr><td>Autostart</td><td>${host.autostart ? 'Yes' : 'No'}</td></tr>
      </table>
      ${guest ? `
        <div class="vm-card-divider"></div>
        <table class="vm-stats">
          <tr><td>Gateway</td><td style="${guest.service_active ? 'color:var(--success)' : 'color:var(--error)'}">${guest.service_active ? 'Active' : 'Inactive'}</td></tr>
          <tr><td>HTTP</td><td style="${guest.http_status === 200 ? 'color:var(--success)' : 'color:var(--error)'}">${guest.http_status || 'N/A'}</td></tr>
          <tr><td>Version</td><td>${escapeHTML(guest.version || '-')}</td></tr>
          <tr><td>Guest Mem</td><td>${guest.memory_percent !== undefined ? guest.memory_percent.toFixed(1) : '-'}%</td></tr>
          <tr><td>Guest Disk</td><td>${guest.disk_percent !== undefined ? guest.disk_percent.toFixed(1) : '-'}%</td></tr>
          <tr><td>Load</td><td>${guest.load_average !== undefined ? guest.load_average.toFixed(2) : '-'}</td></tr>
          <tr><td>Uptime</td><td>${escapeHTML(guest.service_uptime || '-')}</td></tr>
        </table>
      ` : ''}
      ${issues && Object.values(issues).some(Boolean) ? `
        <div class="vm-card-divider"></div>
        <div class="vm-issues-label">Issues</div>
        <div class="vm-issues">
          ${Object.entries(issues).filter(([, value]) => value).map(([key]) => `
            <span class="issue ${escapeHTML(key)}">${escapeHTML(key.replace(/_/g, ' '))}</span>
          `).join('')}
        </div>
      ` : ''}
    </div>
  `;
}

function renderServiceCard(svc) {
  const role = svc.role || 'unknown';
  switch (role) {
    case 'llm-proxy': return renderLLMProxyCard(svc);
    case 'db':        return renderDBCard(svc);
    case 'search':    return renderSearchCard(svc);
    case 'mcp':       return renderMCPCard(svc);
    case 'voice':     return renderVoiceCard(svc);
    case 'automation':return renderAutomationCard(svc);
    default:          return renderGenericServiceCard(svc);
  }
}

function serviceCardHeader(svc) {
  return `
    <div class="service-card-header">
      <div>
        <div class="service-card-name">${escapeHTML(svc.name)}</div>
        <div class="service-role-tag">${escapeHTML(svc.role || '')}</div>
      </div>
      <span class="service-badge ${escapeHTML(svc.status || 'down')}">${escapeHTML(svc.status || 'down')}</span>
    </div>
  `;
}

function serviceStatRow(label, value, valueClass) {
  return `
    <div class="service-stat-row">
      <span class="service-stat-label">${escapeHTML(label)}</span>
      <span class="service-stat-value${valueClass ? ' ' + valueClass : ''}">${value}</span>
    </div>
  `;
}

function formatUptime(sec) {
  if (!sec) return '-';
  if (sec < 60) return sec + 's';
  if (sec < 3600) return Math.floor(sec / 60) + 'm';
  if (sec < 86400) return Math.floor(sec / 3600) + 'h ' + Math.floor((sec % 3600) / 60) + 'm';
  return Math.floor(sec / 86400) + 'd ' + Math.floor((sec % 86400) / 3600) + 'h';
}

function renderLLMProxyCard(svc) {
  const extra = svc.extra || {};
  const modelCount = extra.model_count;
  const cooldowns = extra.cooldown_count || 0;
  const httpStatus = svc.http_status;
  const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : '';

  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div style="display:flex;align-items:baseline;gap:0.5rem">
        <span class="llm-model-count">${modelCount !== undefined ? modelCount : '-'}</span>
        <span class="llm-model-label">models</span>
      </div>
      ${cooldowns > 0 ? `<div class="llm-cooldown-banner">⚠ ${cooldowns} model${cooldowns > 1 ? 's' : ''} in cooldown</div>` : ''}
      <div class="service-stats">
        ${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
      </div>
    </div>
  `;
}

function renderDBCard(svc) {
  const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
      </div>
    </div>
  `;
}

function renderSearchCard(svc) {
  const extra = svc.extra || {};
  const ms = extra.response_ms;
  const httpStatus = svc.http_status;
  const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : '';
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)}
        ${ms !== undefined ? serviceStatRow('Response', ms + 'ms', ms < 500 ? 'ok' : 'warn') : ''}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
      </div>
    </div>
  `;
}

function renderMCPCard(svc) {
  const extra = svc.extra || {};
  const reachable = extra.port_reachable;
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${reachable !== undefined ? serviceStatRow('Port', reachable ? 'reachable' : 'unreachable', reachable ? 'ok' : 'bad') : ''}
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
      </div>
    </div>
  `;
}

function renderVoiceCard(svc) {
  const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
      </div>
    </div>
  `;
}

function renderAutomationCard(svc) {
  const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
      </div>
    </div>
  `;
}

function renderGenericServiceCard(svc) {
  return `
    <div class="service-card">
      ${serviceCardHeader(svc)}
      <div class="service-stats">
        ${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
        ${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
      </div>
    </div>
  `;
}

Step 8: Verify build

Run: cd /home/will/lab/agentmon && go build ./... Expected: no errors

Step 9: Commit

git add cmd/web-ui/static/app.js cmd/web-ui/static/index.html
git commit -m "feat: rename OpenClaw to Infrastructure page, add service cards"

Task 8: End-to-end verification

Step 1: Build all binaries

Run: cd /home/will/lab/agentmon && go build ./... Expected: no errors

Step 2: Test docker label filtering manually

Run: docker ps -a --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Labels}}\t{{.Status}}" Expected: lists swarm containers that are currently running with their labels

Step 3: Test swarm-monitor dry run

Run:

cd /home/will/lab/agentmon
NATS_URL=nats://localhost:4222 LITELLM_MASTER_KEY=$(source /home/will/lab/swarm/.env && echo $LITELLM_MASTER_KEY) \
  go run ./cmd/swarm-monitor/ 2>&1 | head -20

Expected: logs "swarm-monitor started", then either publishes events or logs connection errors (NATS may not be running locally — that's fine, look for the collection phase to succeed before the publish fails)

Step 4: Navigate to /infrastructure in browser

Open the web UI and navigate to /infrastructure. Verify:

  • Nav shows "Infra" link, active when on /infrastructure
  • VMs section shows existing openclaw cards
  • Services section shows either cards (if swarm events exist in DB) or "No swarm service data"

Step 5: Verify swarm strip on dashboard

Navigate to /. Verify:

  • VM strip still shows (zap/orb/sun)
  • Swarm strip renders below it (may be empty if no swarm.snapshot events in DB yet)

Step 6: Final commit if any fixes needed

git add -A
git commit -m "fix: infrastructure page and swarm strip polish"