Files
agentmon/docs/plans/2026-03-18-swarm-monitor-plan.md
T
2026-03-18 09:57:51 -07:00

1283 lines
35 KiB
Markdown

# Swarm Monitor Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** Add a `swarm-monitor` binary that polls docker-compose services in `~/lab/swarm`, emits `swarm.snapshot` and `swarm.service.snapshot` events to NATS, and surfaces service status on the dashboard strip and a new unified `/infrastructure` page (replacing `/openclaw`).
**Architecture:** New `cmd/swarm-monitor/main.go` polls via `docker inspect` exec commands and HTTP probes, emitting two event types per poll. The existing NATS → event-processor → postgres → query-api pipeline requires zero changes. Frontend adds a swarm strip to the dashboard and merges VM cards + service cards on a renamed `/infrastructure` page.
**Tech Stack:** Go (exec/docker CLI, net/http), vanilla JS, existing NATS publisher pattern
---
### Task 1: Add agentmon labels to docker-compose.yaml
**Files:**
- Modify: `/home/will/lab/swarm/docker-compose.yaml`
**Step 1: Add labels to each service**
Add a `labels:` block to each monitored service. `litellm-init` is a one-shot container — do NOT label it.
For `whisper-server` (after its `healthcheck:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "voice"
agentmon.port: "18801"
```
For `kokoro-tts` (after `restart: unless-stopped`):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "voice"
agentmon.port: "18805"
```
For `brave-search` (after its `environment:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "mcp"
agentmon.port: "18802"
```
For `searxng` (after its `volumes:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "search"
agentmon.port: "18803"
```
For `litellm` (after its `healthcheck:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "llm-proxy"
agentmon.port: "18804"
```
For `litellm-db` (after its `healthcheck:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "db"
```
For `n8n-agent` (after its `healthcheck:` block):
```yaml
labels:
agentmon.monitor: "true"
agentmon.role: "automation"
agentmon.port: "18808"
```
**Step 2: Verify labels appear in running containers**
Run: `docker ps --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Status}}"`
Expected: lists currently-running swarm containers (whichever profiles are active).
**Step 3: Commit**
```bash
cd /home/will/lab/swarm
git add docker-compose.yaml
git commit -m "feat: add agentmon monitor labels to swarm services"
```
---
### Task 2: Create swarm types
**Files:**
- Create: `internal/monitor/swarm/types.go`
**Step 1: Create the types file**
```go
package swarm
import "time"
// ServiceSnapshot holds the collected state for one docker-compose service.
type ServiceSnapshot struct {
Name string `json:"name"`
Role string `json:"role"`
ContainerState string `json:"container_state"` // running/stopped/exited/missing
HealthState string `json:"health_state"` // healthy/unhealthy/starting/none
Status string `json:"status"` // healthy/degraded/down
UptimeSec int64 `json:"uptime_sec,omitempty"`
HTTPStatus *int `json:"http_status,omitempty"`
Extra map[string]any `json:"extra,omitempty"`
}
// SwarmSnapshot holds a rolled-up snapshot of all labeled services.
type SwarmSnapshot struct {
Services []ServiceSnapshot `json:"services"`
Issues Issues `json:"issues"`
Timestamp time.Time `json:"timestamp"`
}
// Issues flags notable problems detected during a poll.
type Issues struct {
ServiceDown []string `json:"service_down,omitempty"`
ServiceDegraded []string `json:"service_degraded,omitempty"`
LLMCooldowns bool `json:"llm_cooldowns,omitempty"`
}
```
**Step 2: Verify it compiles**
Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/`
Expected: no errors
**Step 3: Commit**
```bash
git add internal/monitor/swarm/types.go
git commit -m "feat: add swarm monitor types"
```
---
### Task 3: Create swarm collector
**Files:**
- Create: `internal/monitor/swarm/collector.go`
**Step 1: Create the collector**
```go
package swarm
import (
"context"
"encoding/json"
"fmt"
"net/http"
"os/exec"
"strconv"
"strings"
"time"
)
// Config holds collector configuration.
type Config struct {
LiteLLMBaseURL string
LiteLLMAPIKey string
HTTPTimeout time.Duration
}
// dockerPsEntry is the JSON shape from `docker ps --format '{{json .}}'`.
type dockerPsEntry struct {
ID string `json:"ID"`
Names string `json:"Names"`
Status string `json:"Status"`
State string `json:"State"`
}
// dockerInspectEntry is the minimal shape we need from `docker inspect`.
type dockerInspectEntry struct {
Name string `json:"Name"`
State struct {
Status string `json:"Status"`
Running bool `json:"Running"`
StartedAt string `json:"StartedAt"`
Health *struct {
Status string `json:"Status"`
} `json:"Health"`
} `json:"State"`
Config struct {
Labels map[string]string `json:"Labels"`
} `json:"Config"`
}
// CollectAll lists all containers labeled agentmon.monitor=true and collects
// a ServiceSnapshot for each.
func CollectAll(ctx context.Context, cfg Config) ([]ServiceSnapshot, error) {
// List labeled containers (running + stopped).
out, err := exec.CommandContext(ctx, "docker", "ps", "-a",
"--filter", "label=agentmon.monitor=true",
"--format", "{{json .}}",
).Output()
if err != nil {
return nil, fmt.Errorf("docker ps failed: %w", err)
}
var entries []dockerPsEntry
for _, line := range strings.Split(strings.TrimSpace(string(out)), "\n") {
if line == "" {
continue
}
var e dockerPsEntry
if err := json.Unmarshal([]byte(line), &e); err != nil {
continue
}
entries = append(entries, e)
}
client := &http.Client{Timeout: cfg.HTTPTimeout}
var snapshots []ServiceSnapshot
for _, e := range entries {
snap := collectOne(ctx, e.Names, client, cfg)
snapshots = append(snapshots, snap)
}
return snapshots, nil
}
func collectOne(ctx context.Context, name string, client *http.Client, cfg Config) ServiceSnapshot {
snap := ServiceSnapshot{
Name: name,
ContainerState: "missing",
HealthState: "none",
Status: "down",
}
// Inspect for detailed state.
out, err := exec.CommandContext(ctx, "docker", "inspect", "--format", "{{json .}}", name).Output()
if err != nil {
return snap
}
var detail dockerInspectEntry
if err := json.Unmarshal(out, &detail); err != nil {
return snap
}
snap.Role = detail.Config.Labels["agentmon.role"]
snap.ContainerState = detail.State.Status
if detail.State.Health != nil {
snap.HealthState = detail.State.Health.Status
}
// Calculate uptime if running.
if detail.State.Running && detail.State.StartedAt != "" {
if t, err := time.Parse(time.RFC3339Nano, detail.State.StartedAt); err == nil {
snap.UptimeSec = int64(time.Since(t).Seconds())
}
}
// Role-specific probes.
switch snap.Role {
case "llm-proxy":
collectLLMProxy(ctx, &snap, client, cfg)
case "search":
collectHTTPProbe(ctx, &snap, client, "http://localhost:"+detail.Config.Labels["agentmon.port"]+"/")
case "mcp":
collectPortProbe(ctx, &snap, detail.Config.Labels["agentmon.port"])
case "db", "voice", "automation":
// Docker healthcheck state is sufficient; no HTTP probe.
}
snap.Status = deriveStatus(snap)
return snap
}
func collectLLMProxy(ctx context.Context, snap *ServiceSnapshot, client *http.Client, cfg Config) {
if snap.Extra == nil {
snap.Extra = make(map[string]any)
}
// Health probe.
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/health/liveliness", nil)
resp, err := client.Do(req)
if err == nil {
code := resp.StatusCode
snap.HTTPStatus = &code
resp.Body.Close()
}
// Model count.
if cfg.LiteLLMAPIKey != "" {
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, cfg.LiteLLMBaseURL+"/v2/model/info", nil)
req.Header.Set("Authorization", "Bearer "+cfg.LiteLLMAPIKey)
resp, err := client.Do(req)
if err == nil {
defer resp.Body.Close()
var result struct {
Data []struct {
ModelName string `json:"model_name"`
} `json:"data"`
}
if json.NewDecoder(resp.Body).Decode(&result) == nil {
snap.Extra["model_count"] = len(result.Data)
}
}
}
}
func collectHTTPProbe(ctx context.Context, snap *ServiceSnapshot, client *http.Client, url string) {
start := time.Now()
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
resp, err := client.Do(req)
if err == nil {
code := resp.StatusCode
snap.HTTPStatus = &code
resp.Body.Close()
ms := time.Since(start).Milliseconds()
if snap.Extra == nil {
snap.Extra = make(map[string]any)
}
snap.Extra["response_ms"] = ms
}
}
func collectPortProbe(ctx context.Context, snap *ServiceSnapshot, port string) {
if port == "" {
return
}
// Use nc to check TCP reachability.
err := exec.CommandContext(ctx, "nc", "-z", "-w1", "localhost", port).Run()
reachable := err == nil
if snap.Extra == nil {
snap.Extra = make(map[string]any)
}
snap.Extra["port_reachable"] = reachable
}
// deriveStatus computes the overall status from container state + health + probes.
func deriveStatus(snap ServiceSnapshot) string {
if snap.ContainerState != "running" {
return "down"
}
if snap.HealthState == "unhealthy" {
return "degraded"
}
if snap.HTTPStatus != nil && (*snap.HTTPStatus < 200 || *snap.HTTPStatus >= 400) {
return "degraded"
}
if reachable, ok := snap.Extra["port_reachable"].(bool); ok && !reachable {
return "degraded"
}
return "healthy"
}
// DetectIssues scans a set of snapshots for notable problems.
func DetectIssues(services []ServiceSnapshot) Issues {
issues := Issues{}
for _, s := range services {
switch s.Status {
case "down":
issues.ServiceDown = append(issues.ServiceDown, s.Name)
case "degraded":
issues.ServiceDegraded = append(issues.ServiceDegraded, s.Name)
}
if s.Role == "llm-proxy" {
if extra := s.Extra; extra != nil {
if count, ok := extra["cooldown_count"].(int); ok && count > 0 {
issues.LLMCooldowns = true
}
}
}
}
return issues
}
func intPtr(v int) *int { return &v }
func _ = intPtr // suppress unused warning
func _ = strconv.Itoa // imported for potential future use
```
**Step 2: Verify it compiles**
Run: `cd /home/will/lab/agentmon && go build ./internal/monitor/swarm/`
Expected: no errors
**Step 3: Commit**
```bash
git add internal/monitor/swarm/collector.go
git commit -m "feat: add swarm collector with docker inspect + HTTP probes"
```
---
### Task 4: Create swarm-monitor binary
**Files:**
- Create: `cmd/swarm-monitor/main.go`
**Step 1: Create the binary**
```go
package main
import (
"context"
"encoding/json"
"log"
"os"
"time"
"agentmon/internal/monitor/swarm"
qnats "agentmon/internal/queue/nats"
)
func main() {
natsURL := envDefault("NATS_URL", "nats://nats:4222")
natsTopic := envDefault("NATS_TOPIC", "agentmon.events.v1")
interval := envDefault("POLL_INTERVAL", "30s")
litellmBase := envDefault("LITELLM_BASE_URL", "http://localhost:18804")
litellmKey := os.Getenv("LITELLM_MASTER_KEY")
pub, err := qnats.NewPublisher(natsURL, natsTopic)
if err != nil {
log.Fatalf("failed to connect to NATS: %v", err)
}
defer pub.Close()
pollDuration, err := time.ParseDuration(interval)
if err != nil {
log.Fatalf("invalid poll interval: %v", err)
}
cfg := swarm.Config{
LiteLLMBaseURL: litellmBase,
LiteLLMAPIKey: litellmKey,
HTTPTimeout: 5 * time.Second,
}
ticker := time.NewTicker(pollDuration)
defer ticker.Stop()
ctx := context.Background()
log.Printf("swarm-monitor started, polling every %s", pollDuration)
// Poll immediately on start.
if err := poll(ctx, pub, cfg); err != nil {
log.Printf("initial poll error: %v", err)
}
for range ticker.C {
if err := poll(ctx, pub, cfg); err != nil {
log.Printf("poll error: %v", err)
}
}
}
func poll(ctx context.Context, pub *qnats.Publisher, cfg swarm.Config) error {
services, err := swarm.CollectAll(ctx, cfg)
if err != nil {
return err
}
issues := swarm.DetectIssues(services)
now := time.Now().UTC()
// Emit rolled-up swarm.snapshot.
if err := emit(ctx, pub, "swarm.snapshot", "agentmon.swarm", map[string]any{
"services": services,
"issues": issues,
}, now); err != nil {
log.Printf("failed to emit swarm.snapshot: %v", err)
}
// Emit one swarm.service.snapshot per service.
for _, svc := range services {
if err := emit(ctx, pub, "swarm.service.snapshot", "agentmon.swarm.service", map[string]any{
"service": svc,
}, now); err != nil {
log.Printf("failed to emit swarm.service.snapshot for %s: %v", svc.Name, err)
}
}
return nil
}
func emit(ctx context.Context, pub *qnats.Publisher, eventType, schemaName string, payload map[string]any, ts time.Time) error {
event := map[string]any{
"schema": map[string]any{
"name": schemaName,
"version": 1,
},
"event": map[string]any{
"id": generateID(),
"type": eventType,
"ts": ts.Format(time.RFC3339Nano),
},
"payload": payload,
}
data, err := json.Marshal(event)
if err != nil {
return err
}
return pub.Publish(ctx, data)
}
func generateID() string {
return time.Now().Format("20060102150405") + "-" + randomString(8)
}
func randomString(n int) string {
const chars = "abcdefghijklmnopqrstuvwxyz0123456789"
b := make([]byte, n)
for i := range b {
b[i] = chars[time.Now().Nanosecond()%len(chars)]
time.Sleep(time.Nanosecond)
}
return string(b)
}
func envDefault(key, def string) string {
if v := os.Getenv(key); v != "" {
return v
}
return def
}
```
**Step 2: Verify it compiles**
Run: `cd /home/will/lab/agentmon && go build ./cmd/swarm-monitor/`
Expected: no errors
**Step 3: Verify all binaries still build**
Run: `cd /home/will/lab/agentmon && go build ./...`
Expected: no errors
**Step 4: Commit**
```bash
git add cmd/swarm-monitor/main.go
git commit -m "feat: add swarm-monitor binary"
```
---
### Task 5: Dashboard swarm strip
**Files:**
- Modify: `cmd/web-ui/static/app.js`
- Modify: `cmd/web-ui/static/style.css`
**Step 1: Add swarmState and merge function to app.js**
Near the top of the IIFE, alongside the existing `let openclawState = ...` declaration (line ~49), add:
```js
let swarmState = { services: {} }; // keyed by service name
```
After the existing `mergeOpenClawEvents` function (~line 716), add:
```js
function mergeSwarmSnapshot(evt) {
const payload = getEnvelopePayload(evt);
const services = payload.services || [];
for (const svc of services) {
if (svc.name) swarmState.services[svc.name] = svc;
}
}
function mergeSwarmServiceSnapshot(evt) {
const payload = getEnvelopePayload(evt);
const svc = payload.service;
if (svc && svc.name) swarmState.services[svc.name] = svc;
}
```
**Step 2: Add swarm strip to renderDashboard**
In `renderDashboard()`, the HTML template already has:
```html
<div class="vm-strip" id="dash-vm-strip" style="margin-bottom:1.5rem"></div>
```
Right after that line, add a swarm strip div:
```html
<div class="swarm-strip" id="dash-swarm-strip"></div>
```
**Step 3: Add renderSwarmStrip function**
After the `renderAgentVMStrip_dash` function (~line 1351), add:
```js
function renderSwarmStrip_dash() {
const strip = document.getElementById('dash-swarm-strip');
if (!strip) return;
const services = Object.values(swarmState.services);
if (services.length === 0) return;
strip.innerHTML = services.map(svc => {
const statusClass = svc.status === 'healthy' ? 'active'
: svc.status === 'degraded' ? 'degraded' : 'inactive';
const label = svc.status || 'unknown';
return `
<div class="vm-pill ${statusClass}">
<span class="vm-pill-dot"></span>
<span class="vm-pill-name">${escapeHTML(svc.name)}</span>
<span class="vm-pill-label">${escapeHTML(label)}</span>
</div>
`;
}).join('');
}
```
**Step 4: Wire swarm strip into dashboard data load**
In `renderDashboard()`, the `Promise.all` block loads initial data. After `mergeOpenClawEvents(snapshots.events || [])` and `renderAgentVMStrip_dash()`, add:
```js
const swarmSnaps = await api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] }));
for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt);
renderSwarmStrip_dash();
```
Note: this needs to be inside the try block, before the `if (!isCurrentPath('/')) return;` guard. The simplest placement is to add it to the `Promise.all` array:
Replace the `Promise.all` call in `renderDashboard` to add swarm snapshots:
```js
const [summaryData, tsData, recentData, snapshots, swarmSnaps] = await Promise.all([
api('/v1/stats/summary'),
api('/v1/stats/timeseries?window=1h'),
api('/v1/events?limit=20'),
api('/v1/events?event_type=openclaw.snapshot&limit=100').catch(() => ({ events: [] })),
api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })),
]);
```
Then after `renderAgentVMStrip_dash()`:
```js
for (const evt of swarmSnaps.events || []) mergeSwarmSnapshot(evt);
renderSwarmStrip_dash();
```
**Step 5: Handle swarm events in handleDashboardWS**
In `handleDashboardWS`, after the `openclaw.snapshot` handler block, add:
```js
if (eventType === 'swarm.snapshot') {
mergeSwarmSnapshot(msg.data);
renderSwarmStrip_dash();
return;
}
if (eventType === 'swarm.service.snapshot') {
mergeSwarmServiceSnapshot(msg.data);
renderSwarmStrip_dash();
return;
}
```
**Step 6: Add swarm strip CSS**
In `style.css`, after the `.vm-pill-label` block (~line 750), add:
```css
/* ── Swarm strip ──────────────────────────────────────────── */
.swarm-strip {
display: flex;
flex-wrap: wrap;
gap: 0.75rem;
margin-bottom: 1.5rem;
}
.vm-pill.degraded {
border-color: rgba(251, 191, 36, 0.3);
}
.vm-pill.degraded .vm-pill-dot {
background: var(--warning);
}
```
**Step 7: Verify no JS errors**
Build check: `cd /home/will/lab/agentmon && go build ./...`
Expected: no errors
**Step 8: Commit**
```bash
git add cmd/web-ui/static/app.js cmd/web-ui/static/style.css
git commit -m "feat: add swarm strip to dashboard"
```
---
### Task 6: Infrastructure page CSS
**Files:**
- Modify: `cmd/web-ui/static/style.css`
**Step 1: Add infrastructure page styles**
Append to the end of `style.css`:
```css
/* ── Infrastructure page ──────────────────────────────────── */
.infra-section-title {
font-family: var(--font-display);
font-size: 0.75rem;
font-weight: 700;
color: var(--text-dim);
text-transform: uppercase;
letter-spacing: 0.12em;
margin: 0 0 1rem 0;
}
.infra-section {
margin-bottom: 2rem;
}
/* Service card grid */
.service-grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(260px, 1fr));
gap: 1.25rem;
}
.service-card {
background: var(--surface);
border: 1px solid var(--border);
border-radius: var(--radius-lg);
padding: 1.125rem 1.25rem;
display: flex;
flex-direction: column;
gap: 0.75rem;
transition: border-color 0.2s;
}
.service-card:hover {
border-color: rgba(34, 211, 238, 0.15);
}
.service-card-header {
display: flex;
align-items: center;
justify-content: space-between;
}
.service-card-name {
font-family: var(--font-mono);
font-size: 0.88rem;
font-weight: 600;
color: var(--text-bright);
}
.service-badge {
font-size: 0.65rem;
font-weight: 700;
text-transform: uppercase;
letter-spacing: 0.08em;
padding: 0.2rem 0.55rem;
border-radius: 999px;
}
.service-badge.healthy {
background: rgba(52, 211, 153, 0.12);
color: var(--success);
border: 1px solid rgba(52, 211, 153, 0.2);
}
.service-badge.degraded {
background: rgba(251, 191, 36, 0.12);
color: var(--warning);
border: 1px solid rgba(251, 191, 36, 0.2);
}
.service-badge.down {
background: rgba(248, 113, 113, 0.12);
color: var(--error);
border: 1px solid rgba(248, 113, 113, 0.2);
}
.service-role-tag {
font-size: 0.65rem;
font-family: var(--font-mono);
color: var(--text-dim);
margin-top: -0.25rem;
}
.service-stats {
display: flex;
flex-direction: column;
gap: 0.3rem;
font-size: 0.78rem;
}
.service-stat-row {
display: flex;
justify-content: space-between;
align-items: center;
}
.service-stat-label {
color: var(--text-dim);
font-family: var(--font-mono);
font-size: 0.72rem;
}
.service-stat-value {
color: var(--text);
font-family: var(--font-mono);
font-size: 0.75rem;
}
.service-stat-value.ok { color: var(--success); }
.service-stat-value.warn { color: var(--warning); }
.service-stat-value.bad { color: var(--error); }
/* LiteLLM cooldown warning */
.llm-cooldown-banner {
background: rgba(251, 191, 36, 0.08);
border: 1px solid rgba(251, 191, 36, 0.2);
border-radius: var(--radius);
padding: 0.4rem 0.625rem;
font-size: 0.72rem;
color: var(--warning);
font-family: var(--font-mono);
}
/* LiteLLM model count highlight */
.llm-model-count {
font-family: var(--font-display);
font-size: 1.5rem;
font-weight: 800;
color: var(--text-bright);
letter-spacing: -0.02em;
line-height: 1;
}
.llm-model-label {
font-size: 0.68rem;
color: var(--text-dim);
text-transform: uppercase;
letter-spacing: 0.08em;
}
```
**Step 2: Commit**
```bash
git add cmd/web-ui/static/style.css
git commit -m "feat: add infrastructure page CSS"
```
---
### Task 7: Infrastructure page JS + nav rename
**Files:**
- Modify: `cmd/web-ui/static/app.js`
- Modify: `cmd/web-ui/static/index.html`
**Step 1: Update nav in index.html**
Change the nav link from `OpenClaw` to `Infra` and update the href:
Old:
```html
<nav><a href="/">Dashboard</a><a href="/sessions">Sessions</a><a href="/agents">Agents</a><a href="/openclaw">OpenClaw</a></nav>
```
New:
```html
<nav><a href="/">Dashboard</a><a href="/sessions">Sessions</a><a href="/agents">Agents</a><a href="/infrastructure">Infra</a></nav>
```
**Step 2: Update the router in app.js**
Change line ~153:
```js
} else if (path.startsWith('/openclaw')) {
renderOpenClaw();
```
to:
```js
} else if (path.startsWith('/infrastructure')) {
renderInfrastructure();
```
**Step 3: Add infraUnsubscribe state variable**
Near the existing `let openclawUnsubscribe = null;` declaration (~line 50), add:
```js
let infraUnsubscribe = null;
```
**Step 4: Update cleanupLiveViews to clean up infra subscription**
Find the `cleanupLiveViews` function (~line 107). Replace:
```js
if (openclawUnsubscribe) {
openclawUnsubscribe();
openclawUnsubscribe = null;
}
```
with:
```js
if (openclawUnsubscribe) {
openclawUnsubscribe();
openclawUnsubscribe = null;
}
if (infraUnsubscribe) {
infraUnsubscribe();
infraUnsubscribe = null;
}
```
**Step 5: Replace renderOpenClaw with renderInfrastructure**
Replace the existing `renderOpenClaw` function (lines ~664-680) entirely with:
```js
async function renderInfrastructure() {
app.innerHTML = '<div class="page-header"><h2>Infrastructure</h2></div><p class="empty-state">Loading...</p>';
infraUnsubscribe = subscribeWS(handleInfraWS);
try {
const [ocData, swarmData] = await Promise.all([
api('/v1/events?event_type=openclaw.snapshot&limit=100'),
api('/v1/events?event_type=swarm.snapshot&limit=10').catch(() => ({ events: [] })),
]);
mergeOpenClawEvents(ocData.events || []);
for (const evt of swarmData.events || []) mergeSwarmSnapshot(evt);
if (isCurrentPath('/infrastructure')) {
renderInfraGrid();
}
} catch (e) {
if (isCurrentPath('/infrastructure')) {
app.innerHTML = `<div class="page-header"><h2>Infrastructure</h2></div><p class="empty-state">Error: ${escapeHTML(e.message)}</p>`;
}
}
}
```
**Step 6: Replace handleOpenClawWS with handleInfraWS**
Replace the existing `handleOpenClawWS` function (lines ~682-699) with:
```js
function handleInfraWS(msg) {
if (msg.type !== 'message') return;
const eventType = getEnvelopeType(msg.data);
if (eventType === 'openclaw.snapshot') {
mergeOpenClawEvents([msg.data]);
if (isCurrentPath('/infrastructure')) renderInfraGrid();
if (isCurrentPath('/agents')) renderAgentVMStrip();
return;
}
if (eventType === 'swarm.snapshot') {
mergeSwarmSnapshot(msg.data);
if (isCurrentPath('/infrastructure')) renderInfraGrid();
renderSwarmStrip_dash();
return;
}
if (eventType === 'swarm.service.snapshot') {
mergeSwarmServiceSnapshot(msg.data);
if (isCurrentPath('/infrastructure')) renderInfraGrid();
renderSwarmStrip_dash();
return;
}
}
```
**Step 7: Add renderInfraGrid function**
Replace the existing `renderOpenClawGrid` function (lines ~718-785) with a new `renderInfraGrid` that shows both VMs and service cards. Add it right after the new `handleInfraWS` function:
```js
function renderInfraGrid() {
const vmNames = Object.keys(openclawState.instances).sort();
const services = Object.values(swarmState.services);
app.innerHTML = `
<div class="page-header">
<h2>Infrastructure <span class="live-indicator"><span class="live-dot"></span>Live</span></h2>
</div>
<div class="infra-section">
<p class="infra-section-title">VMs</p>
${vmNames.length === 0
? '<p class="empty-state">No VM data</p>'
: `<div class="vm-grid">${vmNames.map(name => renderVMCard(name)).join('')}</div>`
}
</div>
<div class="infra-section">
<p class="infra-section-title">Services</p>
${services.length === 0
? '<p class="empty-state">No swarm service data</p>'
: `<div class="service-grid">${services.map(svc => renderServiceCard(svc)).join('')}</div>`
}
</div>
`;
}
function renderVMCard(name) {
const evt = openclawState.instances[name];
const payload = getEnvelopePayload(evt);
const inst = payload.instance || {};
const host = payload.host || {};
const guest = payload.guest;
const issues = payload.issues;
return `
<div class="vm-card">
<div class="vm-card-header">
<h3>${escapeHTML(inst.name || name)}</h3>
<div class="vm-status ${host.state === 'running' ? 'running' : 'stopped'}">
${host.state === 'running' ? 'Running' : 'Stopped'}
</div>
</div>
<div class="vm-updated">Updated ${escapeHTML(relativeTime(getEnvelopeTS(evt)))}</div>
<table class="vm-stats">
<tr><td>Host</td><td>${escapeHTML(inst.host || '-')}</td></tr>
<tr><td>Domain</td><td>${escapeHTML(inst.domain || '-')}</td></tr>
<tr><td>vCPUs</td><td>${host.vcpus || '-'}</td></tr>
<tr><td>Memory</td><td>${escapeHTML(formatBytes(host.memory_kib ? host.memory_kib * 1024 : 0) || '-')}</td></tr>
<tr><td>Disk</td><td>${escapeHTML(formatBytes(host.disk_actual_bytes) || '-')}</td></tr>
<tr><td>Autostart</td><td>${host.autostart ? 'Yes' : 'No'}</td></tr>
</table>
${guest ? `
<div class="vm-card-divider"></div>
<table class="vm-stats">
<tr><td>Gateway</td><td style="${guest.service_active ? 'color:var(--success)' : 'color:var(--error)'}">${guest.service_active ? 'Active' : 'Inactive'}</td></tr>
<tr><td>HTTP</td><td style="${guest.http_status === 200 ? 'color:var(--success)' : 'color:var(--error)'}">${guest.http_status || 'N/A'}</td></tr>
<tr><td>Version</td><td>${escapeHTML(guest.version || '-')}</td></tr>
<tr><td>Guest Mem</td><td>${guest.memory_percent !== undefined ? guest.memory_percent.toFixed(1) : '-'}%</td></tr>
<tr><td>Guest Disk</td><td>${guest.disk_percent !== undefined ? guest.disk_percent.toFixed(1) : '-'}%</td></tr>
<tr><td>Load</td><td>${guest.load_average !== undefined ? guest.load_average.toFixed(2) : '-'}</td></tr>
<tr><td>Uptime</td><td>${escapeHTML(guest.service_uptime || '-')}</td></tr>
</table>
` : ''}
${issues && Object.values(issues).some(Boolean) ? `
<div class="vm-card-divider"></div>
<div class="vm-issues-label">Issues</div>
<div class="vm-issues">
${Object.entries(issues).filter(([, value]) => value).map(([key]) => `
<span class="issue ${escapeHTML(key)}">${escapeHTML(key.replace(/_/g, ' '))}</span>
`).join('')}
</div>
` : ''}
</div>
`;
}
function renderServiceCard(svc) {
const role = svc.role || 'unknown';
switch (role) {
case 'llm-proxy': return renderLLMProxyCard(svc);
case 'db': return renderDBCard(svc);
case 'search': return renderSearchCard(svc);
case 'mcp': return renderMCPCard(svc);
case 'voice': return renderVoiceCard(svc);
case 'automation':return renderAutomationCard(svc);
default: return renderGenericServiceCard(svc);
}
}
function serviceCardHeader(svc) {
return `
<div class="service-card-header">
<div>
<div class="service-card-name">${escapeHTML(svc.name)}</div>
<div class="service-role-tag">${escapeHTML(svc.role || '')}</div>
</div>
<span class="service-badge ${escapeHTML(svc.status || 'down')}">${escapeHTML(svc.status || 'down')}</span>
</div>
`;
}
function serviceStatRow(label, value, valueClass) {
return `
<div class="service-stat-row">
<span class="service-stat-label">${escapeHTML(label)}</span>
<span class="service-stat-value${valueClass ? ' ' + valueClass : ''}">${value}</span>
</div>
`;
}
function formatUptime(sec) {
if (!sec) return '-';
if (sec < 60) return sec + 's';
if (sec < 3600) return Math.floor(sec / 60) + 'm';
if (sec < 86400) return Math.floor(sec / 3600) + 'h ' + Math.floor((sec % 3600) / 60) + 'm';
return Math.floor(sec / 86400) + 'd ' + Math.floor((sec % 86400) / 3600) + 'h';
}
function renderLLMProxyCard(svc) {
const extra = svc.extra || {};
const modelCount = extra.model_count;
const cooldowns = extra.cooldown_count || 0;
const httpStatus = svc.http_status;
const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : '';
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div style="display:flex;align-items:baseline;gap:0.5rem">
<span class="llm-model-count">${modelCount !== undefined ? modelCount : '-'}</span>
<span class="llm-model-label">models</span>
</div>
${cooldowns > 0 ? `<div class="llm-cooldown-banner">⚠ ${cooldowns} model${cooldowns > 1 ? 's' : ''} in cooldown</div>` : ''}
<div class="service-stats">
${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
</div>
</div>
`;
}
function renderDBCard(svc) {
const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
</div>
</div>
`;
}
function renderSearchCard(svc) {
const extra = svc.extra || {};
const ms = extra.response_ms;
const httpStatus = svc.http_status;
const httpClass = httpStatus === 200 ? 'ok' : httpStatus ? 'bad' : '';
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${serviceStatRow('HTTP', httpStatus ? String(httpStatus) : '-', httpClass)}
${ms !== undefined ? serviceStatRow('Response', ms + 'ms', ms < 500 ? 'ok' : 'warn') : ''}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
</div>
</div>
`;
}
function renderMCPCard(svc) {
const extra = svc.extra || {};
const reachable = extra.port_reachable;
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${reachable !== undefined ? serviceStatRow('Port', reachable ? 'reachable' : 'unreachable', reachable ? 'ok' : 'bad') : ''}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
</div>
</div>
`;
}
function renderVoiceCard(svc) {
const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
</div>
</div>
`;
}
function renderAutomationCard(svc) {
const healthClass = svc.health_state === 'healthy' ? 'ok' : svc.health_state === 'unhealthy' ? 'bad' : '';
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${serviceStatRow('Health', escapeHTML(svc.health_state || 'none'), healthClass)}
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
</div>
</div>
`;
}
function renderGenericServiceCard(svc) {
return `
<div class="service-card">
${serviceCardHeader(svc)}
<div class="service-stats">
${serviceStatRow('Container', escapeHTML(svc.container_state || '-'), svc.container_state === 'running' ? 'ok' : 'bad')}
${serviceStatRow('Uptime', formatUptime(svc.uptime_sec), '')}
</div>
</div>
`;
}
```
**Step 8: Verify build**
Run: `cd /home/will/lab/agentmon && go build ./...`
Expected: no errors
**Step 9: Commit**
```bash
git add cmd/web-ui/static/app.js cmd/web-ui/static/index.html
git commit -m "feat: rename OpenClaw to Infrastructure page, add service cards"
```
---
### Task 8: End-to-end verification
**Step 1: Build all binaries**
Run: `cd /home/will/lab/agentmon && go build ./...`
Expected: no errors
**Step 2: Test docker label filtering manually**
Run: `docker ps -a --filter label=agentmon.monitor=true --format "table {{.Names}}\t{{.Labels}}\t{{.Status}}"`
Expected: lists swarm containers that are currently running with their labels
**Step 3: Test swarm-monitor dry run**
Run:
```bash
cd /home/will/lab/agentmon
NATS_URL=nats://localhost:4222 LITELLM_MASTER_KEY=$(source /home/will/lab/swarm/.env && echo $LITELLM_MASTER_KEY) \
go run ./cmd/swarm-monitor/ 2>&1 | head -20
```
Expected: logs "swarm-monitor started", then either publishes events or logs connection errors (NATS may not be running locally — that's fine, look for the collection phase to succeed before the publish fails)
**Step 4: Navigate to /infrastructure in browser**
Open the web UI and navigate to `/infrastructure`.
Verify:
- Nav shows "Infra" link, active when on `/infrastructure`
- VMs section shows existing openclaw cards
- Services section shows either cards (if swarm events exist in DB) or "No swarm service data"
**Step 5: Verify swarm strip on dashboard**
Navigate to `/`.
Verify:
- VM strip still shows (zap/orb/sun)
- Swarm strip renders below it (may be empty if no `swarm.snapshot` events in DB yet)
**Step 6: Final commit if any fixes needed**
```bash
git add -A
git commit -m "fix: infrastructure page and swarm strip polish"
```