Files

T

William Valentin 6a79e0e336 docs: document swarm infrastructure topology

2026-05-16 12:45:02 -07:00

7.7 KiB

Raw Blame History

Swarm Infrastructure

This document is the source-of-truth overview for Will's local swarm/agent infrastructure on the zap workstation. It focuses on the runtime services that support Atlas/Hermes, n8n automation, local model/search/voice tooling, Obsidian/RAG automation, and the new agentmon monitoring layer.

High-level topology

Telegram / Discord / Email
        |
        v
Hermes / Atlas gateway (default profile)
        |
        +--> local tools and specialist profiles
        +--> n8n automation workflows on :18808

n8n automation
        |
        +--> direct watchdog probes for key service ports
        +--> Agentmon Health Watchdog -> agentmon-query :8081
        +--> Obsidian, RAG, voice memo, URL capture, digest workflows

agentmon
        |
        +--> agentmon-swarm-monitor -> Docker labels agentmon.monitor=true
        +--> agentmon-openclaw-monitor -> OpenClaw VM snapshots
        +--> NATS JetStream -> event processor -> Postgres
        +--> query API / UI on :8081 / :8082

local AI/search/voice services
        |
        +--> LiteLLM :18804
        +--> SearXNG :18803
        +--> Brave MCP :18802
        +--> llama.cpp :18806
        +--> Ollama embeddings :18807
        +--> Kokoro TTS :18805
        +--> Whisper :18811

See also: swarm-infrastructure.html for a visual architecture diagram.

Runtime layers

1. Messaging and agent gateway

Hermes / Atlas default profile is the production messaging gateway.
Connected platforms include Telegram, Discord, and email.
Atlas uses local swarm services where suitable, especially search, local LLMs, embeddings, STT/TTS, n8n, and agentmon.
Specialist Hermes profiles are available for delegated work, but the default profile remains the stable production gateway.

2. n8n automation

Container/service:

n8n-agent
Host URL: http://127.0.0.1:18808
Container URL: http://127.0.0.1:5678
Compose project: /home/will/lab/swarm/docker-compose.yaml

Important workflow source exports live under:

swarm-common/n8n-workflows/

Current health/automation patterns:

Swarm Health Watchdog: direct endpoint checks for search, LLM, voice, n8n, Docker health, etc.
Agentmon Health Watchdog: polls agentmon aggregate snapshots and alerts on stale/degraded monitoring state.
RAG and Embedding Health Watchdog: checks RAG/search/embedding path.
Obsidian workflows: health/reindex, inbox triage, daily review, URL-to-note, chat summary capture, weekly decision/runbook extraction.

3. Agentmon monitoring layer

Repo:

/home/will/lab/agentmon

Compose services:

agentmon-ingest on :8080 — ingestion gateway, /healthz
agentmon-query on :8081 — query API, /healthz, /v1/events, /v1/stats/summary
agentmon-ui on :8082 — web UI, /healthz
agentmon-processor — NATS to Postgres event processor
agentmon-swarm-monitor — monitors Docker containers labeled agentmon.monitor=true
agentmon-openclaw-monitor — emits OpenClaw VM snapshots
agentmon-db — Postgres
agentmon-nats — NATS JetStream

Key query endpoints:

http://127.0.0.1:8080/healthz
http://127.0.0.1:8081/healthz
http://127.0.0.1:8082/healthz
http://127.0.0.1:8081/v1/stats/summary
http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1
http://127.0.0.1:8081/v1/events?event_type=swarm.service.snapshot&limit=20
http://127.0.0.1:8081/v1/events?event_type=openclaw.snapshot&limit=3

From inside n8n-agent, use the Docker bridge gateway:

http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1

4. Local AI, search, and voice services

Docker services:

litellm — :18804, OpenAI-compatible LLM router
litellm-db — Postgres backing LiteLLM
searxng — :18803, local metasearch
brave-search — :18802, Brave Search MCP server
kokoro-tts — :18805, local TTS
whisper-server — :18811, local transcription
n8n-agent — :18808, automation

Host/user services:

llama-server.service — :18806, local llama.cpp OpenAI-compatible LLM
ollama.service — :18807, embeddings API
docker-health-endpoint.service — :18809, read-only container health for n8n
obsidian-reindex-endpoint.service — :18810, Obsidian/RAG reindex trigger
url-content-extractor.service — :18812, YouTube/PDF/web extraction
voice-memo-processor.service — :18813, voice memo processing
rag-embedding-health.service — :18814, RAG/embedding health wrapper

5. Obsidian and RAG

Vault:

/home/will/lab/swarm/swarm-common/obsidian-vault/will/will-shared-zap

Local REST API:

HTTP: 127.0.0.1:27123
HTTPS: 127.0.0.1:27124

RAG/vector store:

ChromaDB path: ~/.hermes/data/rag-search/chroma/
Embeddings backend: Ollama on :18807, normally nomic-embed-text

Monitoring model

The monitoring design is intentionally layered:

n8n direct probes check critical service endpoints and send deduped alerts.
agentmon continuously observes labeled Docker services and OpenClaw state, then writes snapshots through NATS/Postgres.
n8n Agentmon Health Watchdog polls agentmon's aggregate state and alerts if the monitoring pipeline itself becomes stale/degraded.
Hermes/Atlas can inspect both n8n and agentmon when troubleshooting, and can use the same endpoints as part of operational checks.

This means a single process being alive is not enough: the important signal is whether collection, ingestion, processing, storage, query, and alerting are all functioning.

Agentmon Health Watchdog

Workflow source:

swarm-common/n8n-workflows/agentmon-health-watchdog.json

Installed n8n workflow:

Name: Agentmon Health Watchdog
ID: AgentmonHealthWatchdog
Schedule: every 5 minutes

Alert conditions:

agentmon-ingest, agentmon-query, or agentmon-ui /healthz fails.
Latest swarm.snapshot is missing.
Latest swarm.snapshot is older than 3 minutes.
Snapshot issues are non-empty.
Required agentmon services are missing or not healthy/running:
- agentmon-ingest
- agentmon-query
- agentmon-ui
- agentmon-processor
- agentmon-swarm-monitor
- agentmon-db
- agentmon-nats

Deduplication:

Alert after 2 failed checks.
Reminder every 6 failed runs.
Recovery message when state returns healthy.

Operational quick checks

From the host:

cd /home/will/lab/swarm
make status
make local-ai-health
curl -fsS http://127.0.0.1:18808/healthz
curl -fsS http://127.0.0.1:8081/healthz
curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq .

From inside n8n-agent:

docker exec n8n-agent /bin/sh -lc '
  wget -qO- -T 5 http://172.19.0.1:8081/healthz
  wget -qO- -T 5 "http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1" | head -c 500
'

Verify n8n workflow activation:

docker exec -u node n8n-agent n8n export:workflow \
  --id=AgentmonHealthWatchdog \
  --output=/tmp/agentmon-export.json

docker cp n8n-agent:/tmp/agentmon-export.json /tmp/agentmon-export.json
jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json

Notes and pitfalls

Do not commit .env, decrypted credentials, raw credential exports, or runtime DB files.
n8n workflow backups can contain sensitive operational data; keep timestamped raw backups untracked unless intentionally sanitized.
From host, use 127.0.0.1:<host-port>.
From n8n-agent, use 127.0.0.1:5678 for n8n itself and 172.19.0.1:<host-port> for host-published swarm services.
Agentmon /healthz only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing.
OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.

7.7 KiB Raw Blame History