Files
swarm-master/docs/swarm-infrastructure.md
T
2026-05-16 12:45:02 -07:00

7.7 KiB

Swarm Infrastructure

This document is the source-of-truth overview for Will's local swarm/agent infrastructure on the zap workstation. It focuses on the runtime services that support Atlas/Hermes, n8n automation, local model/search/voice tooling, Obsidian/RAG automation, and the new agentmon monitoring layer.

High-level topology

Telegram / Discord / Email
        |
        v
Hermes / Atlas gateway (default profile)
        |
        +--> local tools and specialist profiles
        +--> n8n automation workflows on :18808

n8n automation
        |
        +--> direct watchdog probes for key service ports
        +--> Agentmon Health Watchdog -> agentmon-query :8081
        +--> Obsidian, RAG, voice memo, URL capture, digest workflows

agentmon
        |
        +--> agentmon-swarm-monitor -> Docker labels agentmon.monitor=true
        +--> agentmon-openclaw-monitor -> OpenClaw VM snapshots
        +--> NATS JetStream -> event processor -> Postgres
        +--> query API / UI on :8081 / :8082

local AI/search/voice services
        |
        +--> LiteLLM :18804
        +--> SearXNG :18803
        +--> Brave MCP :18802
        +--> llama.cpp :18806
        +--> Ollama embeddings :18807
        +--> Kokoro TTS :18805
        +--> Whisper :18811

See also: swarm-infrastructure.html for a visual architecture diagram.

Runtime layers

1. Messaging and agent gateway

  • Hermes / Atlas default profile is the production messaging gateway.
  • Connected platforms include Telegram, Discord, and email.
  • Atlas uses local swarm services where suitable, especially search, local LLMs, embeddings, STT/TTS, n8n, and agentmon.
  • Specialist Hermes profiles are available for delegated work, but the default profile remains the stable production gateway.

2. n8n automation

Container/service:

  • n8n-agent
  • Host URL: http://127.0.0.1:18808
  • Container URL: http://127.0.0.1:5678
  • Compose project: /home/will/lab/swarm/docker-compose.yaml

Important workflow source exports live under:

  • swarm-common/n8n-workflows/

Current health/automation patterns:

  • Swarm Health Watchdog: direct endpoint checks for search, LLM, voice, n8n, Docker health, etc.
  • Agentmon Health Watchdog: polls agentmon aggregate snapshots and alerts on stale/degraded monitoring state.
  • RAG and Embedding Health Watchdog: checks RAG/search/embedding path.
  • Obsidian workflows: health/reindex, inbox triage, daily review, URL-to-note, chat summary capture, weekly decision/runbook extraction.

3. Agentmon monitoring layer

Repo:

  • /home/will/lab/agentmon

Compose services:

  • agentmon-ingest on :8080 — ingestion gateway, /healthz
  • agentmon-query on :8081 — query API, /healthz, /v1/events, /v1/stats/summary
  • agentmon-ui on :8082 — web UI, /healthz
  • agentmon-processor — NATS to Postgres event processor
  • agentmon-swarm-monitor — monitors Docker containers labeled agentmon.monitor=true
  • agentmon-openclaw-monitor — emits OpenClaw VM snapshots
  • agentmon-db — Postgres
  • agentmon-nats — NATS JetStream

Key query endpoints:

http://127.0.0.1:8080/healthz
http://127.0.0.1:8081/healthz
http://127.0.0.1:8082/healthz
http://127.0.0.1:8081/v1/stats/summary
http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1
http://127.0.0.1:8081/v1/events?event_type=swarm.service.snapshot&limit=20
http://127.0.0.1:8081/v1/events?event_type=openclaw.snapshot&limit=3

From inside n8n-agent, use the Docker bridge gateway:

http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1

4. Local AI, search, and voice services

Docker services:

  • litellm:18804, OpenAI-compatible LLM router
  • litellm-db — Postgres backing LiteLLM
  • searxng:18803, local metasearch
  • brave-search:18802, Brave Search MCP server
  • kokoro-tts:18805, local TTS
  • whisper-server:18811, local transcription
  • n8n-agent:18808, automation

Host/user services:

  • llama-server.service:18806, local llama.cpp OpenAI-compatible LLM
  • ollama.service:18807, embeddings API
  • docker-health-endpoint.service:18809, read-only container health for n8n
  • obsidian-reindex-endpoint.service:18810, Obsidian/RAG reindex trigger
  • url-content-extractor.service:18812, YouTube/PDF/web extraction
  • voice-memo-processor.service:18813, voice memo processing
  • rag-embedding-health.service:18814, RAG/embedding health wrapper

5. Obsidian and RAG

Vault:

  • /home/will/lab/swarm/swarm-common/obsidian-vault/will/will-shared-zap

Local REST API:

  • HTTP: 127.0.0.1:27123
  • HTTPS: 127.0.0.1:27124

RAG/vector store:

  • ChromaDB path: ~/.hermes/data/rag-search/chroma/
  • Embeddings backend: Ollama on :18807, normally nomic-embed-text

Monitoring model

The monitoring design is intentionally layered:

  1. n8n direct probes check critical service endpoints and send deduped alerts.
  2. agentmon continuously observes labeled Docker services and OpenClaw state, then writes snapshots through NATS/Postgres.
  3. n8n Agentmon Health Watchdog polls agentmon's aggregate state and alerts if the monitoring pipeline itself becomes stale/degraded.
  4. Hermes/Atlas can inspect both n8n and agentmon when troubleshooting, and can use the same endpoints as part of operational checks.

This means a single process being alive is not enough: the important signal is whether collection, ingestion, processing, storage, query, and alerting are all functioning.

Agentmon Health Watchdog

Workflow source:

  • swarm-common/n8n-workflows/agentmon-health-watchdog.json

Installed n8n workflow:

  • Name: Agentmon Health Watchdog
  • ID: AgentmonHealthWatchdog
  • Schedule: every 5 minutes

Alert conditions:

  • agentmon-ingest, agentmon-query, or agentmon-ui /healthz fails.
  • Latest swarm.snapshot is missing.
  • Latest swarm.snapshot is older than 3 minutes.
  • Snapshot issues are non-empty.
  • Required agentmon services are missing or not healthy/running:
    • agentmon-ingest
    • agentmon-query
    • agentmon-ui
    • agentmon-processor
    • agentmon-swarm-monitor
    • agentmon-db
    • agentmon-nats

Deduplication:

  • Alert after 2 failed checks.
  • Reminder every 6 failed runs.
  • Recovery message when state returns healthy.

Operational quick checks

From the host:

cd /home/will/lab/swarm
make status
make local-ai-health
curl -fsS http://127.0.0.1:18808/healthz
curl -fsS http://127.0.0.1:8081/healthz
curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq .

From inside n8n-agent:

docker exec n8n-agent /bin/sh -lc '
  wget -qO- -T 5 http://172.19.0.1:8081/healthz
  wget -qO- -T 5 "http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1" | head -c 500
'

Verify n8n workflow activation:

docker exec -u node n8n-agent n8n export:workflow \
  --id=AgentmonHealthWatchdog \
  --output=/tmp/agentmon-export.json

docker cp n8n-agent:/tmp/agentmon-export.json /tmp/agentmon-export.json
jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json

Notes and pitfalls

  • Do not commit .env, decrypted credentials, raw credential exports, or runtime DB files.
  • n8n workflow backups can contain sensitive operational data; keep timestamped raw backups untracked unless intentionally sanitized.
  • From host, use 127.0.0.1:<host-port>.
  • From n8n-agent, use 127.0.0.1:5678 for n8n itself and 172.19.0.1:<host-port> for host-published swarm services.
  • Agentmon /healthz only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing.
  • OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.