docs(npu): update integrated health runbooks

This commit is contained in:
William Valentin
2026-06-05 15:52:51 -07:00
parent 9e5ffa0fd0
commit 08fb9ca686
2 changed files with 323 additions and 134 deletions
+201
View File
@@ -0,0 +1,201 @@
# NPU integrated health checks — operator runbook notes
Compact, read-only operator workflow that combines the existing
`scripts/npu-service-health.sh` listener/systemd/embedding-proof probe with the
reviewer-approved `scripts/npu-utilization-digest.py` per-service utilization
and fallback report. Together they form a single safe daily / on-demand NPU
health pass.
Scope:
- Read-only against live services. No restarts, route changes, vector mutation,
advisory POSTs, outbound sends, or memory writes.
- No new persistent services, timers, sockets, compose services, or Dockerfiles
are introduced by this integration. Both scripts are foreground / on-demand.
- Binds verified local-only or on the approved Docker bridge (`172.19.0.1:18830`).
Pre-existing broader binds on the live baseline ports (`18810`, `18814`,
`18816`, `18817`) are noted in the runbook and unchanged here.
- NPU proof requires real inference plus a positive
`/sys/class/accel/accel0/device/npu_busy_time_us` delta. HTTP 200 alone is
not sufficient.
## When to run
- Daily / on-demand ops check.
- After upgrades that touch the NPU stack, OpenVINO, or any of the live
specialists.
- Before any approval-gated change that depends on the NPU reflex layer.
- As the read-only verification step of a deploy or recovery runbook.
## Required artifacts on the branch
| Path | Role |
| --- | --- |
| `scripts/npu-service-health.sh` | Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script. |
| `scripts/npu-utilization-digest.py` | Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact. |
| `docs/npu-utilization-digest.md` | Per-service digest reference. |
| `tests/test_npu_utilization_digest.py` | Offline unit tests for the digest (no live services required). |
## Integrated workflow
### Step 1 — Listener and service-state snapshot
```bash
cd ~/lab/swarm
./scripts/npu-service-health.sh
```
What it verifies, in order:
1. `npu_busy_time_us` counter is readable.
2. Required listeners are present on `18810 / 18814 / 18816 / 18817 / 18818 /
18819 / 18820 / 18829 / 18830`.
3. User systemd services are active/enabled for embeddings, RAG health,
reranker, router/classifier, and the small GenAI worker.
4. Docker Compose `whisper-server-npu` is up.
5. Health endpoints return JSON for the live baseline and local specialists.
6. A single non-private embeddings request to `:18817` produces a positive
sysfs `npu_busy_time_us` delta; the script exits nonzero if there is no
positive delta.
Read the last block (`== Embeddings NPU busy-time proof ==`) first. If
`result=ok` and `sysfs_delta_us > 0`, the central NPU path is healthy. If not,
do not run the digest; triage the embeddings service first.
### Step 2 — Per-service utilization digest
```bash
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
```
Compact output shape:
```text
NPU utilization digest <timestamp>
counter=/sys/class/accel/accel0/device/npu_busy_time_us delta_us=<total>
services_ok=<ok>/<total> proof_ok=<ok>/<proof-capable> fallbacks=<n> gates_closed=<n>
- embeddings: ok=true calls=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- rerank: ok=true calls=1 docs=2 avg_ms=... npu_delta_us=... proof=true mode=NPU
- whisper: ok=true calls=1 jobs=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- classifier: ok=true calls=1 events=1 avg_ms=... npu_delta_us=... proof=true dry_run=true ...
- genai: ok=true jobs=0 loaded=false mode=loaded=false reason=skipped_cold_load
- doc_triage: ok=true calls=1 files=1 avg_ms=... npu_delta_us=... proof=true gate=closed:private-root
- rag_endpoint: ok=true mode=health_only gate=closed:vector-mutation
- rag_health: ok=true mode=health_only
- advisory_gateway: ok=true mode=health_only gate=closed:advisory-post
fallbacks: skipped_cold_load=1
```
Read order for ops:
1. `services_ok` row — anything below `9/9` means a service is down or unhealthy.
2. `proof_ok` row — `proof_ok=5/5` means every probe that ran with a real
inference request produced a positive sysfs NPU delta.
3. `fallbacks:` line — `skipped_cold_load=1` is expected (GenAI worker is
intentionally not cold-loaded). Any other fallback label is a triage signal.
4. `gate=` labels — closed gates that remain closed by design.
### Step 3 — Optional artifact for trend tracking
```bash
scripts/npu-utilization-digest.py --format jsonl
```
Writes a single JSONL line per digest under
`/home/will/.local/state/npu-utilization/digests/<timestamp>.jsonl`. The first
line is the summary; subsequent lines are per-service rows. No JSONL write
happens with `--no-write`.
### Step 4 — Offline unit tests
```bash
python -m pytest tests/test_npu_utilization_digest.py -q
```
Does not require live services. Use to validate digest logic after edits or
before merging.
## Compact proof interpretation
For each proof-capable service, both the response-level `npu_busy_delta_us`
(when the service reports it) and the script's own sysfs before/after delta
must agree and be `> 0`. The proof is only valid when an actual inference
request ran. If a probe was skipped (`reason=skipped_cold_load` or
`reason=smoke_disabled`), `proof_ok` for that row is `None` and the row
contributes a labeled fallback instead of a proof failure.
Proof currently runs on:
- `embeddings` (`:18817`)
- `rerank` (`:18818`)
- `whisper` (`:18816`) when `--include-whisper-smoke=true` (default)
- `classifier` (`:18819`)
- `doc_triage` (`:18829`) when `--include-doc-triage-smoke=true` (default);
proof is via the embeddings service, not directly on the NPU device, so the
row reports `mode=NPU-via-embedding-service`.
Intentionally health-only (no proof row):
- `rag_endpoint` (`:18810`) — closed:vector-mutation
- `rag_health` (`:18814`)
- `advisory_gateway` (`172.19.0.1:18830`) — closed:advisory-post
Intentionally skipped by default:
- `genai` (`:18820`) — `loaded=false` until first use; cold-loading just to
prove the NPU is not free, so it is treated as a labeled fallback rather
than a proof failure. Opt in with `--include-genai-smoke=true` only when the
task actually needs a generation smoke.
## Exit codes and triage gates
`scripts/npu-service-health.sh`:
| Exit | Meaning | Next |
| ---: | --- | --- |
| 0 | All checks passed including embeddings proof. | Continue to digest. |
| 2 | `npu_busy_time_us` not readable. | Check kernel/driver; do not run digest. |
| 3 | Embedding request failed. | Triage `openvino-embeddings.service` and port `:18817`. |
| 4 | Embedding request succeeded but sysfs delta `<= 0`. | Service reachable but not on the NPU; check service logs and device bind. |
`scripts/npu-utilization-digest.py`:
| Exit | Meaning | Next |
| ---: | --- | --- |
| 0 | All reachable services handled; proof/fallback accounting completed. | Inspect `proof_ok` and `fallbacks:` for any unexpected labels. |
| 2 | `--strict-proof` was set and at least one proof-required probe ran without a positive sysfs delta. | Triage the named service's NPU path. |
## Approval gates left closed
The integrated workflow intentionally does not:
- start, stop, restart, enable, or disable any user systemd unit or Docker
Compose service;
- write to or mutate the Chroma collection `obsidian_bge_npu` or any other
vector store;
- change Atlas/Hermes routing or model defaults;
- post classification/generation/triage events to the advisory gateway;
- broaden private document, image, or audio roots;
- bind any new listener, including on `0.0.0.0`;
- write memory, send messages, execute tools, or mutate Kanban state.
These remain approval-gated and are tracked on the `npu-maximization` board.
## Quick reference
```bash
# Single-pass NPU health check (listener + systemd + embeddings proof).
cd ~/lab/swarm && ./scripts/npu-service-health.sh
# Compact digest with per-service proof and fallback accounting.
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
# Same, with a JSONL artifact for trend tracking.
scripts/npu-utilization-digest.py --format jsonl
# Strict mode for CI / pre-merge.
scripts/npu-utilization-digest.py --no-write --strict-proof
# Offline digest logic tests.
python -m pytest tests/test_npu_utilization_digest.py -q
```
@@ -3,7 +3,7 @@ type: runbook
system: openvino-npu-services
status: draft
created: 2026-06-04
updated: 2026-06-04
updated: 2026-06-05
tags:
- runbook
- openvino
@@ -18,33 +18,92 @@ related:
# OpenVINO NPU Services Runbook
This runbook is the integrated operations view for Will's local Intel NPU/OpenVINO services from the `npu-capability-expansion` board.
This runbook is the integrated operations view for Will's local Intel NPU/OpenVINO services after the first approved `npu-maximization` lanes. It treats the NPU as a local reflex layer: classify, embed, rerank, transcribe, triage, and draft compact advisory output while Atlas/Hermes keeps final authority unless a separate approval changes that.
Safety posture:
- Do not restart the live Atlas/Hermes gateway from this runbook.
- Do not change primary Atlas/Hermes routing without explicit Will approval.
- Do not delete, overwrite, or in-place reindex existing Chroma/vector collections.
- Treat HTTP 200 as necessary but not sufficient for NPU-backed services; verify `/sys/class/accel/accel0/device/npu_busy_time_us` before/after an inference.
- Keep endpoints local-only unless Will explicitly approves broader exposure.
- Keep raw prompts, private documents, OCR text, and secrets out of logs and durable handoffs.
- Treat HTTP 200 as necessary but not sufficient for NPU-backed services; verify `/sys/class/accel/accel0/device/npu_busy_time_us` before/after a real inference.
- Keep endpoints local-only or on the approved Docker bridge only; do not add wildcard binds.
- Keep raw prompts, private documents, OCR text, transcripts, and secrets out of logs and durable handoffs.
- Keep operational outputs compact: booleans, counts, paths, deltas, and gates rather than raw JSON dumps.
## Current service map
## Reflex-layer topology
| Capability | Port | Runtime / service | Path | State | Health endpoint | NPU proof |
| --- | ---: | --- | --- | --- | --- | --- |
| Obsidian/RAG endpoint | 18810 | `obsidian-reindex-endpoint.service` / local Python endpoint | `~/lab/swarm/scripts/` | live baseline; uses collection `obsidian_bge_npu` | `http://127.0.0.1:18810/healthz` | indirect via embeddings `:18817`; do not mutate existing collection |
| RAG/embedding health wrapper | 18814 | `rag-embedding-health.service` | `~/lab/swarm/swarm-common/rag-embedding-health.service` | live baseline | `http://127.0.0.1:18814/healthz` | should exercise embeddings path when configured |
| Whisper transcription, OpenVINO NPU | 18816 | Docker Compose service/container `whisper-server-npu` | `~/lab/swarm/whisper-openvino-npu/` | live baseline | `http://127.0.0.1:18816/health` | transcription response includes `npu_busy_delta_us`; sysfs delta must increase |
| OpenVINO embeddings | 18817 | user systemd `openvino-embeddings.service` | `~/lab/swarm/scripts/openvino-embeddings-server.py`; unit in `~/lab/swarm/swarm-common/openvino-embeddings.service` | live baseline, enabled | `http://127.0.0.1:18817/healthz` | embedding response and sysfs delta must be positive |
| NPU reranker prototype | 18818 | optional user systemd `openvino-reranker.service` | `~/lab/swarm/openvino-reranker-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18818/readyz` | `/readyz` reports `device=NPU`; `/v1/rerank` response and sysfs delta must be positive |
| NPU router/classifier prototype | 18819 | optional user systemd `openvino-router-classifier.service` | `~/lab/swarm/openvino-classifier-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18819/healthz` | `/v1/classify` response has positive `npu_busy_delta_us` and `sysfs_npu_busy_delta_us` |
| Small OpenVINO GenAI NPU worker | 18820 | optional user systemd `openvino-genai-npu-worker.service` | `~/lab/swarm/openvino-genai-npu-worker/` | approved prototype; not installed/enabled | `http://127.0.0.1:18820/healthz`; `GET /models` | generation response includes positive `npu_busy_delta_us` |
| Document/image triage prototype | optional 18829 for review only; 18828 was an earlier smoke alternate | CLI-first; foreground local-only server if needed; no persistent unit yet | `~/lab/swarm/openvino-doc-image-triage-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18829/healthz`; `GET /models` | v1 NPU stage is semantic embedding through `:18817`; image classification/OCR remain CPU/local |
```text
event / audio / doc / query / task
-> local OpenVINO/NPU specialists
embeddings :18817, rerank :18818, whisper :18816,
classifier :18819, genai worker :18820, doc/image triage :18829,
advisory gateway 172.19.0.1:18830
-> explicit policy and authority gates
-> Atlas/Hermes or human only when approved/useful
```
Authority split:
- NPU services may advise, label, score, transcribe, embed, rerank, triage explicit roots/files, and draft bounded summaries.
- NPU services must not route Atlas/Hermes, write memory, send outbound messages, restart services, execute tools, mutate Kanban, or mutate vector DBs without separate approval.
## Live baseline services
These are part of the current live local baseline. Use read-only checks unless Will explicitly asks for remediation.
| Capability | Port / bind | Runtime / service | State | Health / proof | Notes |
| --- | ---: | --- | --- | --- | --- |
| Obsidian/RAG endpoint | `18810` | `obsidian-reindex-endpoint.service` / local Python endpoint | live baseline | `http://127.0.0.1:18810/healthz`; NPU proof is indirect through embeddings/rerank | Uses collection `obsidian_bge_npu`; do not mutate/reindex in place. Discovery observed `RAG_RERANK_ENABLED=true` and `RAG_RERANK_REQUIRE_NPU_PROOF=true`; do not change from this runbook. |
| RAG/embedding health wrapper | `18814` | `rag-embedding-health.service` | live baseline | `http://127.0.0.1:18814/healthz` | Health wrapper only; use compact summaries. |
| Whisper transcription | `18816` | Docker Compose service/container `whisper-server-npu` | live baseline | `http://127.0.0.1:18816/health`; transcription response plus sysfs busy delta must increase | Use small non-private WAV fixtures for proof. Do not restart from docs. |
| OpenVINO embeddings | `18817` | user systemd `openvino-embeddings.service` | live baseline, enabled | `http://127.0.0.1:18817/healthz`; embedding response and sysfs delta must be positive | Model `bge-base-en-v1.5-int8-ov`, dim 768. Existing bind is broader than new-service guidance; do not broaden anything else. |
## Live local-only advisory specialists
These services are available locally for advisory/reflex work, not for authority. Some were originally prototypes but discovery/review found them active/enabled; do not reinstall or enable again blindly.
| Capability | Port / bind | Runtime / service | State | Health / proof | Authority boundary |
| --- | ---: | --- | --- | --- | --- |
| NPU reranker | `18818` localhost | `openvino-reranker.service` / `openvino-reranker-npu/` | live local specialist | `/readyz`; `/rerank` response and positive sysfs delta | Rerank only; no vector mutation. |
| NPU router/classifier | `18819` localhost | `openvino-router-classifier.service` / `openvino-classifier-npu/` | live local specialist, dry-run/advisory | `/healthz`; `/v1/classify` response and positive sysfs delta | Labels/recommendations only; no routing, sends, memory writes, restarts, or tool execution. |
| Small OpenVINO GenAI worker | `18820` localhost | `openvino-genai-npu-worker.service` / `openvino-genai-npu-worker/` | live local specialist; may report `loaded=false` until used | `/healthz`, `/models`; generation proof requires positive sysfs delta | Bounded draft/title/summary jobs only; not primary Atlas chat. Avoid cold-load generation unless the task requires it. |
| Document/image triage | `18829` localhost | `openvino-doc-image-triage-npu/` | live local specialist with explicit roots | `/healthz`, `/models`; v1 NPU proof is semantic embedding through `:18817` | Request roots may narrow configured roots, never broaden. OCR/image classification are CPU/local fallbacks. |
| Advisory gateway | `172.19.0.1:18830` approved bridge | `openvino-advisory-gateway.service` / `openvino-advisory-gateway/` | live bridge-facing advisory wrapper | `/healthz`; classify/generate/triage responses include NPU proof | For `n8n-agent` and host cron. POSTs can write metadata events, so use health-only unless classification/draft is in scope. No wildcard bind. |
Port notes:
- `18818`, `18819`, and `18820` are reserved prototype ports from the program plan; check listeners before binding.
- `18820` is reserved for the GenAI worker prototype. Use optional `18829` for document/image triage foreground review until Will approves a final persistent port. `18828` was used in earlier review smoke only and should not be treated as the preferred documented port.
- Existing `:18817` is currently bound on `0.0.0.0` by the user service; prototype services should still default to `127.0.0.1`.
- Prefer localhost for host-only sidecars. The advisory gateway bridge bind is intentionally for Docker bridge consumers such as `n8n-agent`.
- `18828` was an earlier review alternate for doc/image triage and should not be treated as the preferred documented port.
- Check listeners before foreground smokes: `ss -ltnp | grep -E ':(18810|18814|18816|18817|18818|18819|18820|18829|18830)\b'`.
## Dry-run examples and approved lane artifacts
The first-slice lanes below are approved as dry-run/local advisory examples. They may be merged into the repo by the integration lane, but they do not grant authority to mutate live Atlas/Hermes behavior.
| Lane | Approved branch / commit | Artifact paths | Safe use |
| --- | --- | --- | --- |
| Observability/utilization digest | `feature/npu-max-observability` @ `d661dc299` | `docs/npu-utilization-digest.md`, `scripts/npu-utilization-digest.py` | Read-only compact digest; can write JSONL under `~/.local/state/npu-utilization/digests` unless `--no-write`. Reviewer verified services_ok=9/9, proof_ok=5/5 on live smoke. |
| Context-gate advisory CLI | `feature/npu-max-context-gate` @ `b4ef90aff` | `openvino_context_gate/`, `scripts/context-gate-advisory.py` | Plans typed context bundle sources; no retrieval, routing, memory write, or private content. Classifier URL is loopback-only and redirects fail closed. |
| Cron/n8n advisory classifier | `feature/npu-max-cron-n8n` @ `54d3bcb7` | `openvino-advisory-gateway/docs/cron-n8n-advisory-classifier.md`, `examples/cron-advisory-dry-run.sh`, `examples/n8n-advisory-dry-run-fragment.json` | Dry-run event classification: duplicate/stale/no-op/action-required -> suppress/log/summarize/escalate recommendation, then human/Atlas gate before side effects. |
| Explicit-root batch doc/image/audio triage | `feature/npu-max-doc-audio-triage` @ `bfa62cddb` | `docs/npu-batch-triage-dry-run.md`, `scripts/npu-batch-triage-dry-run.py`, `config/triage-roots*.yaml` | Reads only approved/narrow staging roots; reports compact counts/proof; no file moves, Obsidian/RAG writes, sends, or vector mutation. Whisper endpoint override is loopback `:18816` only. |
| Voice/audio local-file pipeline | `feature/npu-max-voice` @ `534816249` | `docs/npu-voice-audio-pipeline.md`, `scripts/npu_voice_audio_pipeline.py` | Local audio file -> Whisper NPU -> classifier NPU -> advisory gate. No platform fetching, sends, writes, memory writes, or routing changes. |
| Kanban/task hygiene advisory | `feature/npu-max-kanban-hygiene` @ `575a3cef6` | `scripts/kanban-hygiene-advisory.py` | Reads compact board summaries and suggests labels/next gates only. Does not call Kanban tools or mutate the board. NPU proof failures dominate generic review-required gates. |
Dry-run command patterns:
```bash
# Compact service/proof digest; no artifact write during review.
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false
# Local-only context-gate planning; does not retrieve private content.
python scripts/context-gate-advisory.py --query "How do I check NPU reranker proof?" --format compact
# Cron/n8n event advisory wrapper; dry-run only, one compact decision line.
openvino-advisory-gateway/examples/cron-advisory-dry-run.sh npu-service-health warning health_check "openvino-reranker timeout twice" "service:openvino-reranker:timeout"
# Explicit-root triage; manifest root may be narrowed by --root, never broadened.
python scripts/npu-batch-triage-dry-run.py --manifest config/triage-roots.test.yaml --lane receipts --root openvino-doc-image-triage-npu/samples --limit 5 --dry-run --json
# Local-file audio advisory; transcript omitted unless explicitly requested.
/home/will/.venvs/npu/bin/python scripts/npu_voice_audio_pipeline.py --audio /tmp/npu-voice-smoke.wav --title "synthetic smoke" --source manual_smoke --json
```
## Read-only unified health check
@@ -55,15 +114,15 @@ cd ~/lab/swarm
./scripts/npu-service-health.sh
```
The script is read-only. It checks listeners for `18810`, `18816`, `18817`, `18818`, `18819`, `18820`, `18829` plus the existing `18814` wrapper and `18828` review alternate, user service state, Docker Compose state for `whisper-server-npu`, JSON health endpoints, and performs a non-private embeddings request while measuring `/sys/class/accel/accel0/device/npu_busy_time_us` before and after. A positive sysfs delta is required for the embeddings proof.
The script is read-only. It checks listeners for the live baseline and local specialists, user service state, Docker Compose state for `whisper-server-npu`, JSON health endpoints, and a non-private embeddings request while measuring `/sys/class/accel/accel0/device/npu_busy_time_us` before and after. A positive sysfs delta is required for the embeddings proof.
Manual minimal checks:
```bash
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
cat "$BUSY"
ss -ltnp | grep -E ':(18810|18816|18817|18818|18819|18820|18829)\b' || true
systemctl --user is-active openvino-embeddings.service rag-embedding-health.service
ss -ltnp | grep -E ':(18810|18814|18816|18817|18818|18819|18820|18829|18830)\b' || true
systemctl --user is-active openvino-embeddings.service rag-embedding-health.service openvino-reranker.service openvino-router-classifier.service openvino-genai-npu-worker.service openvino-doc-image-triage.service openvino-advisory-gateway.service
cd ~/lab/swarm && docker compose ps whisper-server-npu
curl -fsS http://127.0.0.1:18817/healthz | jq .
```
@@ -87,23 +146,7 @@ A healthy NPU path has:
## Service-specific smoke checks
For any foreground prototype server below, run it in a terminal you control or capture its PID and stop it at the end of the smoke. Do not use `systemctl --user enable`, Docker Compose `up -d`, `nohup`, or shell disowning for these review smokes unless Will explicitly approved persistent service enablement.
Safe foreground-server pattern:
```bash
server_pid=""
cleanup() {
if [[ -n "$server_pid" ]] && kill -0 "$server_pid" 2>/dev/null; then
kill "$server_pid"
wait "$server_pid" 2>/dev/null || true
fi
}
trap cleanup EXIT
# start prototype server with --host 127.0.0.1 --port <port> &
# server_pid=$!
# run curl/smoke commands, then let trap stop it
```
For any foreground prototype/server smoke, run it in a terminal you control or capture its PID and stop it at the end. Do not use `systemctl --user enable`, Docker Compose `up -d`, `nohup`, or shell disowning unless Will explicitly approved persistent service enablement. Several specialists are already live; do not start duplicate listeners.
### Whisper NPU (`:18816`)
@@ -115,7 +158,6 @@ curl -fsS http://127.0.0.1:18816/health | jq .
Operational notes:
- Managed as Docker Compose service/container `whisper-server-npu` in `~/lab/swarm`.
- Consistent with existing swarm service patterns because it is a containerized service with Compose health.
- Do not restart it from this runbook unless Will asked for remediation.
### OpenVINO embeddings (`:18817`)
@@ -127,26 +169,10 @@ curl -fsS http://127.0.0.1:18817/healthz | jq .
Operational notes:
- User systemd unit: `openvino-embeddings.service`.
- Model: `bge-base-en-v1.5-int8-ov`.
- Model directory: `/home/will/.cache/openvino-models/bge-base-en-v1.5-int8-ov`.
- Live RAG `:18810` uses Chroma collection `obsidian_bge_npu` through this service. Do not reindex or replace this collection in place.
### Reranker prototype (`:18818`)
Foreground review start only, after confirming port is free:
```bash
ss -ltnp | grep ':18818\b' || true
cd ~/lab/swarm/openvino-reranker-npu
source /home/will/.venvs/openvino-reranker/bin/activate
OPENVINO_RERANKER_HOST=127.0.0.1 \
OPENVINO_RERANKER_PORT=18818 \
OPENVINO_RERANKER_DEVICE=NPU \
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
python server.py
```
From another shell:
### Reranker (`:18818`)
```bash
curl -fsS http://127.0.0.1:18818/readyz | jq .
@@ -154,107 +180,78 @@ python ~/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
```
Approval gate:
- May be installed as `openvino-reranker.service` only after foreground smoke and Will approval.
- May be integrated into RAG only behind disabled-by-default knobs such as `RAG_RERANK_ENABLED=false`; request-time reranking must not mutate Chroma.
- Rerank may score candidate passages only. Any change to RAG answer selection, rerank policy, or vector DB behavior requires separate approval and rollback notes.
### Router/classifier prototype (`:18819`)
Foreground review start only, after confirming port is free:
```bash
ss -ltnp | grep ':18819\b' || true
cd ~/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819
```
Smoke:
### Router/classifier (`:18819`)
```bash
curl -fsS http://127.0.0.1:18819/healthz | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
-H 'Content-Type: application/json' \
-d '{"id":"smoke","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":true,"dry_run":true}}' | jq .
-d '{"id":"smoke","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":false,"dry_run":true}}' | jq '{id, labels, npu_busy_delta_us, sysfs_npu_busy_delta_us}'
```
Approval gate:
- May be installed as `openvino-router-classifier.service` only after Will approves live service enablement.
- Must remain dry-run and must not alter Hermes/Atlas routing, memory writes, safety confirmation flow, or outbound messages without a separate explicit approval.
- Must remain dry-run/advisory and must not alter Hermes/Atlas routing, memory writes, safety confirmation flow, or outbound messages without a separate explicit approval.
### Small GenAI NPU worker (`:18820`)
Foreground review start only, after confirming port is free:
```bash
ss -ltnp | grep ':18820\b' || true
cd ~/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
```
Smoke:
```bash
curl -fsS http://127.0.0.1:18820/healthz | jq .
curl -fsS http://127.0.0.1:18820/models | jq .
curl -fsS http://127.0.0.1:18820/v1/worker/condense-notification \
-H 'Content-Type: application/json' \
-d '{"input":"Non-private smoke notification for local NPU worker.","max_new_tokens":64}' | jq .
```
Approval gate:
- May be installed as `openvino-genai-npu-worker.service` only after Will approves persistent service enablement.
- Must not become primary Atlas/Hermes model routing. Use only for bounded background jobs such as title, summary, notification condensation, and memory-candidate drafting.
- Must not become primary Atlas/Hermes model routing. Use only for bounded local jobs such as title, summary, notification condensation, and memory-candidate drafting after the relevant job is approved.
- Avoid generation smokes that cold-load the model unless the task explicitly calls for it.
### Document/image triage prototype (`:18829` optional review port)
Foreground review start only, after confirming the port is free:
```bash
ss -ltnp | grep ':18829\b' || true
cd ~/lab/swarm/openvino-doc-image-triage-npu
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
```
Smoke:
### Document/image triage (`:18829`)
```bash
curl -fsS http://127.0.0.1:18829/healthz | jq .
curl -fsS http://127.0.0.1:18829/models | jq .
/home/will/.venvs/npu/bin/python tests/smoke_test.py
```
Approval gate:
- Do not point it at arbitrary directories; allowed roots must be equal to or under configured roots.
- Do not include raw OCR text or full source paths unless Will explicitly asks for a one-off response.
- Do not include raw OCR text or full source paths unless Will explicitly asks for one-off debugging.
- v1 only uses the NPU through `:18817` embeddings for needs-attention; image category classification and OCR are CPU/local fallbacks.
## Systemd and Compose recommendations
### Advisory gateway (`172.19.0.1:18830`)
Recommended management split:
- Keep containerized services in Docker Compose when they already have Docker build/runtime shape and Compose health (`whisper-server-npu`).
- Keep host-side OpenVINO Python prototypes as user systemd services when they depend on local venvs, sysfs NPU access, model caches, and localhost-only APIs (`openvino-embeddings`, optional reranker/classifier/GenAI worker).
- Do not add the prototypes to the live gateway or primary routing during installation. Installation and routing are separate approval gates.
```bash
curl -fsS http://172.19.0.1:18830/healthz | jq .
docker exec n8n-agent wget -qO- -T 8 http://172.19.0.1:18830/healthz
```
User-systemd unit expectations for optional prototypes:
- `WorkingDirectory` points at the service directory under `~/lab/swarm/`.
- `ExecStart` uses the existing venv path documented by the prototype.
- `Environment` pins host to `127.0.0.1`, port, model path, device `NPU`, and any upstream endpoint.
- `Restart=on-failure`, not aggressive restart loops.
- Logs go to user journal; do not log raw request bodies.
- Start manually for smoke; enable on boot only after Will approval.
Approval gate:
- Classification/generation/triage POSTs are advisory only and may write metadata counters. Do not wire outputs to sends, restarts, memory writes, tool execution, or Atlas/Hermes routing without a separate reviewed approval.
Compose expectations for existing swarm services:
- Prefer `cd ~/lab/swarm && make ps`, `make status`, and targeted `docker compose ps <service>` for read-only checks.
- Do not run `docker compose up -d`, restart containers, pull images, or prune volumes from this runbook without approval.
## Approval-gated / not-live integrations
The following remain closed even though dry-run examples and local specialists exist:
| Integration | Current gate |
| --- | --- |
| Primary Atlas/Hermes routing changes | closed; no live routing authority changes from this program slice |
| Memory writes from NPU classifier/GenAI/advisory gateway | closed |
| Telegram/Discord/email/outbound sends from cron/n8n/voice/advisory output | closed |
| Service restarts or tool execution triggered by classifier/gateway output | closed |
| Automatic Kanban task mutation, assignment, block/unblock, completion, or task creation | closed |
| Broad private document/image/audio root processing | closed; only explicit approved/narrow roots |
| Vector DB mutation/reindex or Chroma collection replacement | closed |
| Wildcard binds or broader exposure for new services | closed |
| GenAI worker as primary chat model | closed; bounded local drafts only |
| Diffusion/image generation on the NPU | rejected/parked for this program |
## Monitoring and logging notes
Minimum recurring monitoring should include:
- Listener presence for `18816`, `18817`, and any approved optional prototype ports.
- User service state for `openvino-embeddings.service` and any approved optional prototype unit.
- Docker Compose health for `whisper-server-npu`.
- Listener presence for live baseline and any approved specialist ports.
- User service state for OpenVINO services and Docker Compose health for `whisper-server-npu`.
- HTTP health endpoint success.
- Positive sysfs NPU busy-time delta on at least one non-private inference probe, preferably embeddings `:18817` because it is already live and central.
- Journal/container logs only at summary level. Avoid raw prompts, raw OCR text, private document names, credentials, and API keys.
- Compact counts/deltas/gates only. Avoid raw prompts, transcripts, OCR text, private document names, credentials, and API keys.
Useful log commands:
@@ -264,23 +261,14 @@ journalctl --user -u rag-embedding-health.service -n 100 --no-pager
journalctl --user -u openvino-reranker.service -n 100 --no-pager
journalctl --user -u openvino-router-classifier.service -n 100 --no-pager
journalctl --user -u openvino-genai-npu-worker.service -n 100 --no-pager
journalctl --user -u openvino-advisory-gateway.service -n 100 --no-pager
cd ~/lab/swarm && docker compose logs --tail 100 whisper-server-npu
```
## Approval gates
## Approved/parked outcomes
Requires explicit Will approval before proceeding:
- Installing, enabling, or autostarting `openvino-reranker.service`, `openvino-router-classifier.service`, or `openvino-genai-npu-worker.service`.
- Assigning a final persistent port to document/image triage or enabling it as a persistent service.
- Enabling live RAG reranking or any request path that changes Atlas/RAG answers.
- Changing primary Atlas/Hermes routing or connecting router/classifier outputs to live decisions.
- Connecting the GenAI worker to primary Atlas chat, gateway routing, memory writes, or outbound notifications.
- Restarting the live Atlas/Hermes gateway.
- Deleting, overwriting, or in-place reindexing existing vector collections.
- Broadening bind addresses or exposure beyond local-only defaults.
Approved/parked outcomes:
- Built/approved prototypes: reranker (`:18818`), router/classifier (`:18819`), small GenAI worker (`:18820`), document/image triage (review ports `:18828`/`:18829`).
- Live baseline retained: Whisper NPU (`:18816`), OpenVINO embeddings (`:18817`), RAG endpoint (`:18810`) using `obsidian_bge_npu`.
- Live baseline retained: RAG endpoint (`:18810`), RAG health wrapper (`:18814`), Whisper NPU (`:18816`), OpenVINO embeddings (`:18817`).
- Live local-only advisory/reflex specialists: reranker (`:18818`), router/classifier (`:18819`), GenAI worker (`:18820`), doc/image triage (`:18829`), advisory gateway bridge (`172.19.0.1:18830`).
- Approved dry-run examples: utilization digest, context gate plan, cron/n8n advisory classifier, explicit-root batch triage, local-file voice/audio pipeline, Kanban hygiene advisory.
- Parked: always-on wake-word/audio and conventional vision detection until Will wants a concrete use case.
- Rejected for this NPU program: diffusion/image generation.