swarm-master/docs/swarm-infrastructure.md

# Swarm Infrastructure

This document is the source-of-truth overview for Will's local swarm/agent infrastructure on the `zap` workstation. It focuses on the runtime services that support Atlas/Hermes, n8n automation, local model/search/voice tooling, Obsidian/RAG automation, and the new agentmon monitoring layer.

## High-level topology

```text
Telegram / Discord / Email
        |
        v
Hermes / Atlas gateway (default profile)
        |
        +--> local tools and specialist profiles
        +--> n8n automation workflows on :18808

n8n automation
        |
        +--> direct watchdog probes for key service ports
        +--> Agentmon Health Watchdog -> agentmon-query :8081
        +--> Obsidian, RAG, voice memo, URL capture, digest workflows

agentmon
        |
        +--> agentmon-swarm-monitor -> Docker labels agentmon.monitor=true
        +--> agentmon-openclaw-monitor -> OpenClaw VM snapshots
        +--> NATS JetStream -> event processor -> Postgres
        +--> query API / UI on :8081 / :8082

local AI/search/voice services
        |
        +--> LiteLLM :18804
        +--> SearXNG :18803
        +--> Brave MCP :18802
        +--> llama.cpp :18806
        +--> Ollama embeddings :18807 (legacy/CPU fallback)
        +--> OpenVINO NPU embeddings :18817
        +--> Kokoro TTS :18805
        +--> Whisper NPU :18816
```

See also:

- [`swarm-infrastructure.html`](./swarm-infrastructure.html) — visual architecture diagram
- [`diagram-maintenance.md`](./diagram-maintenance.md) — how to keep diagrams updated and when to create new ones

## Runtime layers

### 1. Messaging and agent gateway

- **Hermes / Atlas default profile** is the production messaging gateway.
- Connected platforms include Telegram, Discord, and email.
- Atlas uses local swarm services where suitable, especially search, local LLMs, embeddings, STT/TTS, n8n, and agentmon.
- Specialist Hermes profiles are available for delegated work, but the default profile remains the stable production gateway.

### 2. n8n automation

Container/service:

- `n8n-agent`
- Host URL: `http://127.0.0.1:18808`
- Container URL: `http://127.0.0.1:5678`
- Compose project: `/home/will/lab/swarm/docker-compose.yaml`

Important workflow source exports live under:

- `swarm-common/n8n-workflows/`

Current health/automation patterns:

- **Swarm Health Watchdog**: direct endpoint checks for search, LLM, voice, n8n, Docker health, etc.
- **Agentmon Health Watchdog**: polls agentmon aggregate snapshots and alerts on stale/degraded monitoring state.
- **RAG and Embedding Health Watchdog**: checks RAG/search/embedding path.
- Obsidian workflows: health/reindex, inbox triage, daily review, URL-to-note, chat summary capture, weekly decision/runbook extraction.

### 3. Agentmon monitoring layer

Repo:

- `/home/will/lab/agentmon`

Compose services:

- `agentmon-ingest` on `:8080` — ingestion gateway, `/healthz`
- `agentmon-query` on `:8081` — query API, `/healthz`, `/v1/events`, `/v1/stats/summary`
- `agentmon-ui` on `:8082` — web UI, `/healthz`
- `agentmon-processor` — NATS to Postgres event processor
- `agentmon-swarm-monitor` — monitors Docker containers labeled `agentmon.monitor=true`
- `agentmon-openclaw-monitor` — emits OpenClaw VM snapshots
- `agentmon-db` — Postgres
- `agentmon-nats` — NATS JetStream

Key query endpoints:

```text
http://127.0.0.1:8080/healthz
http://127.0.0.1:8081/healthz
http://127.0.0.1:8082/healthz
http://127.0.0.1:8081/v1/stats/summary
http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1
http://127.0.0.1:8081/v1/events?event_type=swarm.service.snapshot&limit=20
http://127.0.0.1:8081/v1/events?event_type=openclaw.snapshot&limit=3
```

From inside `n8n-agent`, use the Docker bridge gateway:

```text
http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1
```

### 4. Local AI, search, and voice services

Docker services:

- `litellm` — `:18804`, OpenAI-compatible LLM router
- `litellm-db` — Postgres backing LiteLLM
- `searxng` — `:18803`, local metasearch
- `brave-search` — `:18802`, Brave Search MCP server
- `kokoro-tts` — `:18805`, local TTS
- `whisper-server-npu` — `:18816`, OpenVINO NPU local transcription
- `n8n-agent` — `:18808`, automation

Host/user services:

- `llama-server.service` — `:18806`, local llama.cpp OpenAI-compatible LLM
- `ollama.service` — `:18807`, legacy/CPU embeddings API fallback
- `openvino-embeddings.service` — `:18817`, OpenVINO NPU embeddings API (`/v1/embeddings`, `/api/embed`, `/api/embeddings`)
- `docker-health-endpoint.service` — `:18809`, read-only container health for n8n
- `obsidian-reindex-endpoint.service` — `:18810`, Obsidian/RAG reindex trigger
- `url-content-extractor.service` — `:18812`, YouTube/PDF/web extraction
- `voice-memo-processor.service` — `:18813`, voice memo processing
- `rag-embedding-health.service` — `:18814`, RAG/embedding health wrapper

### 5. Obsidian and RAG

Vault:

- `/home/will/lab/swarm/swarm-common/obsidian-vault/will/will-shared-zap`

Local REST API:

- HTTP: `127.0.0.1:27123`
- HTTPS: `127.0.0.1:27124`

RAG/vector store:

- ChromaDB path: `~/.hermes/data/rag-search/chroma/`
- Reindex state/progress: `~/.hermes/data/rag-search/obsidian_index_state.json` and `obsidian_reindex_progress.json`
- RAG query/reindex embedding backend: still Ollama on `:18807` with `nomic-embed-text` until a deliberate full Chroma rebuild/migration is run.
- RAG/embedding health probe backend: OpenVINO NPU embeddings service on `:18817`, currently `bge-base-en-v1.5-int8-ov`.
- Reindex endpoint: `POST :18810/reindex` for incremental updates, `POST :18810/reindex?full=true` for full semantic rebuilds, `GET :18810/semantic-health` to verify vectors plus a search smoke test.

## Monitoring model

The monitoring design is intentionally layered:

1. **n8n direct probes** check critical service endpoints and send deduped alerts.
2. **agentmon** continuously observes labeled Docker services and OpenClaw state, then writes snapshots through NATS/Postgres.
3. **n8n Agentmon Health Watchdog** polls agentmon's aggregate state and alerts if the monitoring pipeline itself becomes stale/degraded.
4. **Hermes/Atlas** can inspect both n8n and agentmon when troubleshooting, and can use the same endpoints as part of operational checks.

This means a single process being alive is not enough: the important signal is whether collection, ingestion, processing, storage, query, and alerting are all functioning.

## Agentmon Health Watchdog

Workflow source:

- `swarm-common/n8n-workflows/agentmon-health-watchdog.json`

Installed n8n workflow:

- Name: `Agentmon Health Watchdog`
- ID: `AgentmonHealthWatchdog`
- Schedule: every 5 minutes

Alert conditions:

- `agentmon-ingest`, `agentmon-query`, or `agentmon-ui` `/healthz` fails.
- Latest `swarm.snapshot` is missing.
- Latest `swarm.snapshot` is older than 3 minutes.
- Snapshot issues are non-empty.
- Required agentmon services are missing or not healthy/running:
  - `agentmon-ingest`
  - `agentmon-query`
  - `agentmon-ui`
  - `agentmon-processor`
  - `agentmon-swarm-monitor`
  - `agentmon-db`
  - `agentmon-nats`

Deduplication:

- Alert after 2 failed checks.
- Reminder every 6 failed runs.
- Recovery message when state returns healthy.

## Operational quick checks

From the host:

```bash
cd /home/will/lab/swarm
make status
make local-ai-health
curl -fsS http://127.0.0.1:18808/healthz
curl -fsS http://127.0.0.1:8081/healthz
curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq .
```

From inside `n8n-agent`:

```bash
docker exec n8n-agent /bin/sh -lc '
  wget -qO- -T 5 http://172.19.0.1:8081/healthz
  wget -qO- -T 5 "http://172.19.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1" | head -c 500
'
```

Verify n8n workflow activation:

```bash
docker exec -u node n8n-agent n8n export:workflow \
  --id=AgentmonHealthWatchdog \
  --output=/tmp/agentmon-export.json

docker cp n8n-agent:/tmp/agentmon-export.json /tmp/agentmon-export.json
jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json
```

## Notes and pitfalls

- Do not commit `.env`, decrypted credentials, raw credential exports, or runtime DB files.
- n8n workflow backups can contain sensitive operational data; keep timestamped raw backups untracked unless intentionally sanitized.
- From host, use `127.0.0.1:<host-port>`.
- From `n8n-agent`, use `127.0.0.1:5678` for n8n itself and `172.19.0.1:<host-port>` for host-published swarm services.
- Agentmon `/healthz` only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing.
- OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.