From 6a79e0e3367914cda3c7cd0818fc80f10e3dd524 Mon Sep 17 00:00:00 2001 From: William Valentin Date: Sat, 16 May 2026 12:45:02 -0700 Subject: [PATCH] docs: document swarm infrastructure topology --- README.md | 8 ++ docs/swarm-infrastructure.html | 113 ++++++++++++++++ docs/swarm-infrastructure.md | 228 +++++++++++++++++++++++++++++++++ 3 files changed, 349 insertions(+) create mode 100644 docs/swarm-infrastructure.html create mode 100644 docs/swarm-infrastructure.md diff --git a/README.md b/README.md index 48b9fa6..86cf32a 100644 --- a/README.md +++ b/README.md @@ -19,6 +19,7 @@ swarm/ │ └── vm/ # VM provisioning role (local) ├── openclaw/ # Live mirror of guest ~/.openclaw/ ├── docker-compose.yaml # LiteLLM + supporting services +├── docs/ # Swarm/agentmon/n8n infrastructure docs + diagrams ├── litellm-config.yaml # LiteLLM static config ├── litellm-init-credentials.sh # Register API keys into LiteLLM DB ├── litellm-init-models.sh # Register models into LiteLLM DB (idempotent) @@ -29,6 +30,13 @@ swarm/ └── README.md # This file ``` +## Current swarm/service architecture + +For the current host-side AI/search/voice automation stack, n8n watchdogs, and agentmon monitoring layer, see: + +- [`docs/swarm-infrastructure.md`](docs/swarm-infrastructure.md) — operational overview and quick checks +- [`docs/swarm-infrastructure.html`](docs/swarm-infrastructure.html) — dark SVG architecture diagram + ## VM: zap | Property | Value | diff --git a/docs/swarm-infrastructure.html b/docs/swarm-infrastructure.html new file mode 100644 index 0000000..a5ca953 --- /dev/null +++ b/docs/swarm-infrastructure.html @@ -0,0 +1,113 @@ + + + + + + Will's Swarm Infrastructure + + + +
+

Will's Swarm Infrastructure

Atlas/Hermes gateway + n8n automation + agentmon monitoring + local AI/search/voice services
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Hermes gateway layer + + n8n + agentmon observability + + local swarm services + + + TelegramDM/groups + Discord#ops-alerts + EmailGmail IMAP + + + Atlas / Hermesdefault profile gatewaytools • memory • specialists + + + n8n-agentautomation workflows:18808 host / :5678 container + agentmon-queryaggregate snapshots/API:8081 /v1/events + agentmon pipelineingest :8080NATS JetStreamevent processorPostgres DBweb UI :8082swarm.snapshot + openclaw.snapshot + + + LiteLLMLLM router + DB:18804 + SearchSearXNG + Brave MCP:18803 / :18802 + VoiceKokoro + Whisper:18805 / :18811 + Docker servicesagentmon.monitor=trueswarm/service snapshots + OpenClaw VMscurrently dormantopenclaw.snapshot + Obsidian / RAG:27123/:27124 + ChromaDB + + + host local AIllama.cpp :18806Ollama embed :18807 + + + + Legend + Gateway/Search/Voice + Automation/API + Data/AI stores + Event bus/pipeline + Monitoring flows + + +
+
+

Monitoring model

  • • n8n direct probes critical ports
  • • agentmon aggregates Docker/OpenClaw snapshots
  • • n8n polls agentmon for stale/degraded state
+

Operational endpoints

  • • n8n: 127.0.0.1:18808
  • • agentmon query/UI: 8081 / 8082
  • • local LLM/embed: 18806 / 18807
+

Source paths

  • • Swarm repo: ~/lab/swarm
  • • Agentmon repo: ~/lab/agentmon
  • • Workflows: swarm-common/n8n-workflows
+
+ +
+ + diff --git a/docs/swarm-infrastructure.md b/docs/swarm-infrastructure.md new file mode 100644 index 0000000..24826f7 --- /dev/null +++ b/docs/swarm-infrastructure.md @@ -0,0 +1,228 @@ +# Swarm Infrastructure + +This document is the source-of-truth overview for Will's local swarm/agent infrastructure on the `zap` workstation. It focuses on the runtime services that support Atlas/Hermes, n8n automation, local model/search/voice tooling, Obsidian/RAG automation, and the new agentmon monitoring layer. + +## High-level topology + +```text +Telegram / Discord / Email + | + v +Hermes / Atlas gateway (default profile) + | + +--> local tools and specialist profiles + +--> n8n automation workflows on :18808 + +n8n automation + | + +--> direct watchdog probes for key service ports + +--> Agentmon Health Watchdog -> agentmon-query :8081 + +--> Obsidian, RAG, voice memo, URL capture, digest workflows + +agentmon + | + +--> agentmon-swarm-monitor -> Docker labels agentmon.monitor=true + +--> agentmon-openclaw-monitor -> OpenClaw VM snapshots + +--> NATS JetStream -> event processor -> Postgres + +--> query API / UI on :8081 / :8082 + +local AI/search/voice services + | + +--> LiteLLM :18804 + +--> SearXNG :18803 + +--> Brave MCP :18802 + +--> llama.cpp :18806 + +--> Ollama embeddings :18807 + +--> Kokoro TTS :18805 + +--> Whisper :18811 +``` + +See also: [`swarm-infrastructure.html`](./swarm-infrastructure.html) for a visual architecture diagram. + +## Runtime layers + +### 1. Messaging and agent gateway + +- **Hermes / Atlas default profile** is the production messaging gateway. +- Connected platforms include Telegram, Discord, and email. +- Atlas uses local swarm services where suitable, especially search, local LLMs, embeddings, STT/TTS, n8n, and agentmon. +- Specialist Hermes profiles are available for delegated work, but the default profile remains the stable production gateway. + +### 2. n8n automation + +Container/service: + +- `n8n-agent` +- Host URL: `http://127.0.0.1:18808` +- Container URL: `http://127.0.0.1:5678` +- Compose project: `/home/will/lab/swarm/docker-compose.yaml` + +Important workflow source exports live under: + +- `swarm-common/n8n-workflows/` + +Current health/automation patterns: + +- **Swarm Health Watchdog**: direct endpoint checks for search, LLM, voice, n8n, Docker health, etc. +- **Agentmon Health Watchdog**: polls agentmon aggregate snapshots and alerts on stale/degraded monitoring state. +- **RAG and Embedding Health Watchdog**: checks RAG/search/embedding path. +- Obsidian workflows: health/reindex, inbox triage, daily review, URL-to-note, chat summary capture, weekly decision/runbook extraction. + +### 3. Agentmon monitoring layer + +Repo: + +- `/home/will/lab/agentmon` + +Compose services: + +- `agentmon-ingest` on `:8080` — ingestion gateway, `/healthz` +- `agentmon-query` on `:8081` — query API, `/healthz`, `/v1/events`, `/v1/stats/summary` +- `agentmon-ui` on `:8082` — web UI, `/healthz` +- `agentmon-processor` — NATS to Postgres event processor +- `agentmon-swarm-monitor` — monitors Docker containers labeled `agentmon.monitor=true` +- `agentmon-openclaw-monitor` — emits OpenClaw VM snapshots +- `agentmon-db` — Postgres +- `agentmon-nats` — NATS JetStream + +Key query endpoints: + +```text +http://127.0.0.1:8080/healthz +http://127.0.0.1:8081/healthz +http://127.0.0.1:8082/healthz +http://127.0.0.1:8081/v1/stats/summary +http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1 +http://127.0.0.1:8081/v1/events?event_type=swarm.service.snapshot&limit=20 +http://127.0.0.1:8081/v1/events?event_type=openclaw.snapshot&limit=3 +``` + +From inside `n8n-agent`, use the Docker bridge gateway: + +```text +http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1 +``` + +### 4. Local AI, search, and voice services + +Docker services: + +- `litellm` — `:18804`, OpenAI-compatible LLM router +- `litellm-db` — Postgres backing LiteLLM +- `searxng` — `:18803`, local metasearch +- `brave-search` — `:18802`, Brave Search MCP server +- `kokoro-tts` — `:18805`, local TTS +- `whisper-server` — `:18811`, local transcription +- `n8n-agent` — `:18808`, automation + +Host/user services: + +- `llama-server.service` — `:18806`, local llama.cpp OpenAI-compatible LLM +- `ollama.service` — `:18807`, embeddings API +- `docker-health-endpoint.service` — `:18809`, read-only container health for n8n +- `obsidian-reindex-endpoint.service` — `:18810`, Obsidian/RAG reindex trigger +- `url-content-extractor.service` — `:18812`, YouTube/PDF/web extraction +- `voice-memo-processor.service` — `:18813`, voice memo processing +- `rag-embedding-health.service` — `:18814`, RAG/embedding health wrapper + +### 5. Obsidian and RAG + +Vault: + +- `/home/will/lab/swarm/swarm-common/obsidian-vault/will/will-shared-zap` + +Local REST API: + +- HTTP: `127.0.0.1:27123` +- HTTPS: `127.0.0.1:27124` + +RAG/vector store: + +- ChromaDB path: `~/.hermes/data/rag-search/chroma/` +- Embeddings backend: Ollama on `:18807`, normally `nomic-embed-text` + +## Monitoring model + +The monitoring design is intentionally layered: + +1. **n8n direct probes** check critical service endpoints and send deduped alerts. +2. **agentmon** continuously observes labeled Docker services and OpenClaw state, then writes snapshots through NATS/Postgres. +3. **n8n Agentmon Health Watchdog** polls agentmon's aggregate state and alerts if the monitoring pipeline itself becomes stale/degraded. +4. **Hermes/Atlas** can inspect both n8n and agentmon when troubleshooting, and can use the same endpoints as part of operational checks. + +This means a single process being alive is not enough: the important signal is whether collection, ingestion, processing, storage, query, and alerting are all functioning. + +## Agentmon Health Watchdog + +Workflow source: + +- `swarm-common/n8n-workflows/agentmon-health-watchdog.json` + +Installed n8n workflow: + +- Name: `Agentmon Health Watchdog` +- ID: `AgentmonHealthWatchdog` +- Schedule: every 5 minutes + +Alert conditions: + +- `agentmon-ingest`, `agentmon-query`, or `agentmon-ui` `/healthz` fails. +- Latest `swarm.snapshot` is missing. +- Latest `swarm.snapshot` is older than 3 minutes. +- Snapshot issues are non-empty. +- Required agentmon services are missing or not healthy/running: + - `agentmon-ingest` + - `agentmon-query` + - `agentmon-ui` + - `agentmon-processor` + - `agentmon-swarm-monitor` + - `agentmon-db` + - `agentmon-nats` + +Deduplication: + +- Alert after 2 failed checks. +- Reminder every 6 failed runs. +- Recovery message when state returns healthy. + +## Operational quick checks + +From the host: + +```bash +cd /home/will/lab/swarm +make status +make local-ai-health +curl -fsS http://127.0.0.1:18808/healthz +curl -fsS http://127.0.0.1:8081/healthz +curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq . +``` + +From inside `n8n-agent`: + +```bash +docker exec n8n-agent /bin/sh -lc ' + wget -qO- -T 5 http://172.19.0.1:8081/healthz + wget -qO- -T 5 "http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1" | head -c 500 +' +``` + +Verify n8n workflow activation: + +```bash +docker exec -u node n8n-agent n8n export:workflow \ + --id=AgentmonHealthWatchdog \ + --output=/tmp/agentmon-export.json + +docker cp n8n-agent:/tmp/agentmon-export.json /tmp/agentmon-export.json +jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json +``` + +## Notes and pitfalls + +- Do not commit `.env`, decrypted credentials, raw credential exports, or runtime DB files. +- n8n workflow backups can contain sensitive operational data; keep timestamped raw backups untracked unless intentionally sanitized. +- From host, use `127.0.0.1:`. +- From `n8n-agent`, use `127.0.0.1:5678` for n8n itself and `172.19.0.1:` for host-published swarm services. +- Agentmon `/healthz` only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing. +- OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.