# Swarm Infrastructure This document is the source-of-truth overview for Will's local swarm/agent infrastructure on the `zap` workstation. It focuses on the runtime services that support Atlas/Hermes, n8n automation, local model/search/voice tooling, Obsidian/RAG automation, and the new agentmon monitoring layer. ## High-level topology ```text Telegram / Discord / Email | v Hermes / Atlas gateway (default profile) | +--> local tools and specialist profiles +--> n8n automation workflows on :18808 n8n automation | +--> direct watchdog probes for key service ports +--> Agentmon Health Watchdog -> agentmon-query :8081 +--> Obsidian, RAG, voice memo, URL capture, digest workflows agentmon | +--> agentmon-swarm-monitor -> Docker labels agentmon.monitor=true +--> agentmon-openclaw-monitor -> OpenClaw VM snapshots +--> NATS JetStream -> event processor -> Postgres +--> query API / UI on :8081 / :8082 local AI/search/voice services | +--> LiteLLM :18804 +--> SearXNG :18803 +--> Brave MCP :18802 +--> llama.cpp :18806 +--> Ollama embeddings :18807 +--> Kokoro TTS :18805 +--> Whisper :18811 ``` See also: - [`swarm-infrastructure.html`](./swarm-infrastructure.html) — visual architecture diagram - [`diagram-maintenance.md`](./diagram-maintenance.md) — how to keep diagrams updated and when to create new ones ## Runtime layers ### 1. Messaging and agent gateway - **Hermes / Atlas default profile** is the production messaging gateway. - Connected platforms include Telegram, Discord, and email. - Atlas uses local swarm services where suitable, especially search, local LLMs, embeddings, STT/TTS, n8n, and agentmon. - Specialist Hermes profiles are available for delegated work, but the default profile remains the stable production gateway. ### 2. n8n automation Container/service: - `n8n-agent` - Host URL: `http://127.0.0.1:18808` - Container URL: `http://127.0.0.1:5678` - Compose project: `/home/will/lab/swarm/docker-compose.yaml` Important workflow source exports live under: - `swarm-common/n8n-workflows/` Current health/automation patterns: - **Swarm Health Watchdog**: direct endpoint checks for search, LLM, voice, n8n, Docker health, etc. - **Agentmon Health Watchdog**: polls agentmon aggregate snapshots and alerts on stale/degraded monitoring state. - **RAG and Embedding Health Watchdog**: checks RAG/search/embedding path. - Obsidian workflows: health/reindex, inbox triage, daily review, URL-to-note, chat summary capture, weekly decision/runbook extraction. ### 3. Agentmon monitoring layer Repo: - `/home/will/lab/agentmon` Compose services: - `agentmon-ingest` on `:8080` — ingestion gateway, `/healthz` - `agentmon-query` on `:8081` — query API, `/healthz`, `/v1/events`, `/v1/stats/summary` - `agentmon-ui` on `:8082` — web UI, `/healthz` - `agentmon-processor` — NATS to Postgres event processor - `agentmon-swarm-monitor` — monitors Docker containers labeled `agentmon.monitor=true` - `agentmon-openclaw-monitor` — emits OpenClaw VM snapshots - `agentmon-db` — Postgres - `agentmon-nats` — NATS JetStream Key query endpoints: ```text http://127.0.0.1:8080/healthz http://127.0.0.1:8081/healthz http://127.0.0.1:8082/healthz http://127.0.0.1:8081/v1/stats/summary http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1 http://127.0.0.1:8081/v1/events?event_type=swarm.service.snapshot&limit=20 http://127.0.0.1:8081/v1/events?event_type=openclaw.snapshot&limit=3 ``` From inside `n8n-agent`, use the Docker bridge gateway: ```text http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1 ``` ### 4. Local AI, search, and voice services Docker services: - `litellm` — `:18804`, OpenAI-compatible LLM router - `litellm-db` — Postgres backing LiteLLM - `searxng` — `:18803`, local metasearch - `brave-search` — `:18802`, Brave Search MCP server - `kokoro-tts` — `:18805`, local TTS - `whisper-server` — `:18811`, local transcription - `n8n-agent` — `:18808`, automation Host/user services: - `llama-server.service` — `:18806`, local llama.cpp OpenAI-compatible LLM - `ollama.service` — `:18807`, embeddings API - `docker-health-endpoint.service` — `:18809`, read-only container health for n8n - `obsidian-reindex-endpoint.service` — `:18810`, Obsidian/RAG reindex trigger - `url-content-extractor.service` — `:18812`, YouTube/PDF/web extraction - `voice-memo-processor.service` — `:18813`, voice memo processing - `rag-embedding-health.service` — `:18814`, RAG/embedding health wrapper ### 5. Obsidian and RAG Vault: - `/home/will/lab/swarm/swarm-common/obsidian-vault/will/will-shared-zap` Local REST API: - HTTP: `127.0.0.1:27123` - HTTPS: `127.0.0.1:27124` RAG/vector store: - ChromaDB path: `~/.hermes/data/rag-search/chroma/` - Embeddings backend: Ollama on `:18807`, normally `nomic-embed-text` ## Monitoring model The monitoring design is intentionally layered: 1. **n8n direct probes** check critical service endpoints and send deduped alerts. 2. **agentmon** continuously observes labeled Docker services and OpenClaw state, then writes snapshots through NATS/Postgres. 3. **n8n Agentmon Health Watchdog** polls agentmon's aggregate state and alerts if the monitoring pipeline itself becomes stale/degraded. 4. **Hermes/Atlas** can inspect both n8n and agentmon when troubleshooting, and can use the same endpoints as part of operational checks. This means a single process being alive is not enough: the important signal is whether collection, ingestion, processing, storage, query, and alerting are all functioning. ## Agentmon Health Watchdog Workflow source: - `swarm-common/n8n-workflows/agentmon-health-watchdog.json` Installed n8n workflow: - Name: `Agentmon Health Watchdog` - ID: `AgentmonHealthWatchdog` - Schedule: every 5 minutes Alert conditions: - `agentmon-ingest`, `agentmon-query`, or `agentmon-ui` `/healthz` fails. - Latest `swarm.snapshot` is missing. - Latest `swarm.snapshot` is older than 3 minutes. - Snapshot issues are non-empty. - Required agentmon services are missing or not healthy/running: - `agentmon-ingest` - `agentmon-query` - `agentmon-ui` - `agentmon-processor` - `agentmon-swarm-monitor` - `agentmon-db` - `agentmon-nats` Deduplication: - Alert after 2 failed checks. - Reminder every 6 failed runs. - Recovery message when state returns healthy. ## Operational quick checks From the host: ```bash cd /home/will/lab/swarm make status make local-ai-health curl -fsS http://127.0.0.1:18808/healthz curl -fsS http://127.0.0.1:8081/healthz curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq . ``` From inside `n8n-agent`: ```bash docker exec n8n-agent /bin/sh -lc ' wget -qO- -T 5 http://172.19.0.1:8081/healthz wget -qO- -T 5 "http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1" | head -c 500 ' ``` Verify n8n workflow activation: ```bash docker exec -u node n8n-agent n8n export:workflow \ --id=AgentmonHealthWatchdog \ --output=/tmp/agentmon-export.json docker cp n8n-agent:/tmp/agentmon-export.json /tmp/agentmon-export.json jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json ``` ## Notes and pitfalls - Do not commit `.env`, decrypted credentials, raw credential exports, or runtime DB files. - n8n workflow backups can contain sensitive operational data; keep timestamped raw backups untracked unless intentionally sanitized. - From host, use `127.0.0.1:`. - From `n8n-agent`, use `127.0.0.1:5678` for n8n itself and `172.19.0.1:` for host-published swarm services. - Agentmon `/healthz` only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing. - OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.