13 KiB
type, system, status, created, updated, tags, related
| type | system | status | created | updated | tags | related | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| runbook | openvino-npu-services | draft | 2026-06-04 | 2026-06-04 |
|
|
OpenVINO NPU Services Runbook
This runbook is the integrated operations view for Will's local Intel NPU/OpenVINO services from the npu-capability-expansion board.
Safety posture:
- Do not restart the live Atlas/Hermes gateway from this runbook.
- Do not change primary Atlas/Hermes routing without explicit Will approval.
- Do not delete, overwrite, or in-place reindex existing Chroma/vector collections.
- Treat HTTP 200 as necessary but not sufficient for NPU-backed services; verify
/sys/class/accel/accel0/device/npu_busy_time_usbefore/after an inference. - Keep endpoints local-only unless Will explicitly approves broader exposure.
- Keep raw prompts, private documents, OCR text, and secrets out of logs and durable handoffs.
Current service map
| Capability | Port | Runtime / service | Path | State | Health endpoint | NPU proof |
|---|---|---|---|---|---|---|
| Obsidian/RAG endpoint | 18810 | obsidian-reindex-endpoint.service / local Python endpoint |
~/lab/swarm/scripts/ |
live baseline; uses collection obsidian_bge_npu |
http://127.0.0.1:18810/healthz |
indirect via embeddings :18817; do not mutate existing collection |
| RAG/embedding health wrapper | 18814 | rag-embedding-health.service |
~/lab/swarm/swarm-common/rag-embedding-health.service |
live baseline | http://127.0.0.1:18814/healthz |
should exercise embeddings path when configured |
| Whisper transcription, OpenVINO NPU | 18816 | Docker Compose service/container whisper-server-npu |
~/lab/swarm/whisper-openvino-npu/ |
live baseline | http://127.0.0.1:18816/health |
transcription response includes npu_busy_delta_us; sysfs delta must increase |
| OpenVINO embeddings | 18817 | user systemd openvino-embeddings.service |
~/lab/swarm/scripts/openvino-embeddings-server.py; unit in ~/lab/swarm/swarm-common/openvino-embeddings.service |
live baseline, enabled | http://127.0.0.1:18817/health |
embedding response and sysfs delta must be positive |
| NPU reranker prototype | 18818 | optional user systemd openvino-reranker.service |
~/lab/swarm/openvino-reranker-npu/ |
approved prototype; not installed/enabled | http://127.0.0.1:18818/readyz |
/readyz reports device=NPU; /v1/rerank response and sysfs delta must be positive |
| NPU router/classifier prototype | 18819 | optional user systemd openvino-router-classifier.service |
~/lab/swarm/openvino-classifier-npu/ |
approved prototype; not installed/enabled | http://127.0.0.1:18819/healthz |
/v1/classify response has positive npu_busy_delta_us and sysfs_npu_busy_delta_us |
| Small OpenVINO GenAI NPU worker | 18820 | optional user systemd openvino-genai-npu-worker.service |
~/lab/swarm/openvino-genai-npu-worker/ |
approved prototype; not installed/enabled | http://127.0.0.1:18820/healthz; GET /models |
generation response includes positive npu_busy_delta_us |
| Document/image triage prototype | optional 18829 for review only | CLI-first; foreground local-only server if needed; no persistent unit yet | ~/lab/swarm/openvino-doc-image-triage-npu/ |
approved prototype; not installed/enabled | http://127.0.0.1:18829/healthz; GET /models |
v1 NPU stage is semantic embedding through :18817; image classification/OCR remain CPU/local |
Port notes:
18818,18819, and18820are reserved prototype ports from the program plan; check listeners before binding.18820is reserved for the GenAI worker prototype. Use optional18829for document/image triage foreground review until Will approves a final persistent port.18828was used in earlier review smoke only and should not be treated as the preferred documented port.- Existing
:18817is currently bound on0.0.0.0by the user service; prototype services should still default to127.0.0.1.
Read-only unified health check
From the swarm repo:
cd ~/lab/swarm
./scripts/npu-service-health.sh
The script is read-only. It checks listeners, user service state, Docker Compose state for whisper-server-npu, JSON health endpoints, and performs a non-private embeddings request while measuring /sys/class/accel/accel0/device/npu_busy_time_us before and after. A positive sysfs delta is required for the embeddings proof.
Manual minimal checks:
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
cat "$BUSY"
ss -ltnp | grep -E ':(18810|18814|18816|18817|18818|18819|18820|18828|18829)\b' || true
systemctl --user is-active openvino-embeddings.service rag-embedding-health.service
cd ~/lab/swarm && docker compose ps whisper-server-npu
curl -fsS http://127.0.0.1:18817/health | jq .
Embedding NPU proof:
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
before=$(cat "$BUSY")
curl -fsS http://127.0.0.1:18817/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"input":"non-private npu health probe","model":"bge-base-en-v1.5-int8-ov"}' | jq '{model, object, npu_busy_delta_us, embedding_count:(.data|length)}'
after=$(cat "$BUSY")
echo "sysfs_npu_busy_delta_us=$((after-before))"
A healthy NPU path has:
- HTTP success from the endpoint.
- Response-level
npu_busy_delta_us > 0when the service reports it. - Sysfs
after - before > 0.
Service-specific smoke checks
Whisper NPU (:18816)
curl -fsS http://127.0.0.1:18816/health | jq .
# For a real transcription smoke, use a small non-private WAV fixture only.
# Verify both response npu_busy_delta_us and sysfs busy-time delta.
Operational notes:
- Managed as Docker Compose service/container
whisper-server-npuin~/lab/swarm. - Consistent with existing swarm service patterns because it is a containerized service with Compose health.
- Do not restart it from this runbook unless Will asked for remediation.
OpenVINO embeddings (:18817)
systemctl --user status openvino-embeddings.service --no-pager
curl -fsS http://127.0.0.1:18817/health | jq .
Operational notes:
- User systemd unit:
openvino-embeddings.service. - Model:
bge-base-en-v1.5-int8-ov. - Model directory:
/home/will/.cache/openvino-models/bge-base-en-v1.5-int8-ov. - Live RAG
:18810uses Chroma collectionobsidian_bge_nputhrough this service. Do not reindex or replace this collection in place.
Reranker prototype (:18818)
Foreground review start only, after confirming port is free:
ss -ltnp | grep ':18818\b' || true
cd ~/lab/swarm/openvino-reranker-npu
source /home/will/.venvs/openvino-reranker/bin/activate
OPENVINO_RERANKER_HOST=127.0.0.1 \
OPENVINO_RERANKER_PORT=18818 \
OPENVINO_RERANKER_DEVICE=NPU \
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
python server.py
From another shell:
curl -fsS http://127.0.0.1:18818/readyz | jq .
python ~/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
Approval gate:
- May be installed as
openvino-reranker.serviceonly after foreground smoke and Will approval. - May be integrated into RAG only behind disabled-by-default knobs such as
RAG_RERANK_ENABLED=false; request-time reranking must not mutate Chroma.
Router/classifier prototype (:18819)
Foreground review start only, after confirming port is free:
ss -ltnp | grep ':18819\b' || true
cd ~/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819
Smoke:
curl -fsS http://127.0.0.1:18819/healthz | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
-H 'Content-Type: application/json' \
-d '{"id":"smoke","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":true,"dry_run":true}}' | jq .
Approval gate:
- May be installed as
openvino-router-classifier.serviceonly after Will approves live service enablement. - Must remain dry-run and must not alter Hermes/Atlas routing, memory writes, safety confirmation flow, or outbound messages without a separate explicit approval.
Small GenAI NPU worker (:18820)
Foreground review start only, after confirming port is free:
ss -ltnp | grep ':18820\b' || true
cd ~/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
Smoke:
curl -fsS http://127.0.0.1:18820/healthz | jq .
curl -fsS http://127.0.0.1:18820/models | jq .
curl -fsS http://127.0.0.1:18820/v1/worker/condense-notification \
-H 'Content-Type: application/json' \
-d '{"input":"Non-private smoke notification for local NPU worker.","max_new_tokens":64}' | jq .
Approval gate:
- May be installed as
openvino-genai-npu-worker.serviceonly after Will approves persistent service enablement. - Must not become primary Atlas/Hermes model routing. Use only for bounded background jobs such as title, summary, notification condensation, and memory-candidate drafting.
Document/image triage prototype (:18829 optional review port)
Foreground review start only, after confirming the port is free:
ss -ltnp | grep ':18829\b' || true
cd ~/lab/swarm/openvino-doc-image-triage-npu
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
Smoke:
curl -fsS http://127.0.0.1:18829/healthz | jq .
curl -fsS http://127.0.0.1:18829/models | jq .
/home/will/.venvs/npu/bin/python tests/smoke_test.py
Approval gate:
- Do not point it at arbitrary directories; allowed roots must be equal to or under configured roots.
- Do not include raw OCR text or full source paths unless Will explicitly asks for a one-off response.
- v1 only uses the NPU through
:18817embeddings for needs-attention; image category classification and OCR are CPU/local fallbacks.
Systemd and Compose recommendations
Recommended management split:
- Keep containerized services in Docker Compose when they already have Docker build/runtime shape and Compose health (
whisper-server-npu). - Keep host-side OpenVINO Python prototypes as user systemd services when they depend on local venvs, sysfs NPU access, model caches, and localhost-only APIs (
openvino-embeddings, optional reranker/classifier/GenAI worker). - Do not add the prototypes to the live gateway or primary routing during installation. Installation and routing are separate approval gates.
User-systemd unit expectations for optional prototypes:
WorkingDirectorypoints at the service directory under~/lab/swarm/.ExecStartuses the existing venv path documented by the prototype.Environmentpins host to127.0.0.1, port, model path, deviceNPU, and any upstream endpoint.Restart=on-failure, not aggressive restart loops.- Logs go to user journal; do not log raw request bodies.
- Start manually for smoke; enable on boot only after Will approval.
Compose expectations for existing swarm services:
- Prefer
cd ~/lab/swarm && make ps,make status, and targeteddocker compose ps <service>for read-only checks. - Do not run
docker compose up -d, restart containers, pull images, or prune volumes from this runbook without approval.
Monitoring and logging notes
Minimum recurring monitoring should include:
- Listener presence for
18816,18817, and any approved optional prototype ports. - User service state for
openvino-embeddings.serviceand any approved optional prototype unit. - Docker Compose health for
whisper-server-npu. - HTTP health endpoint success.
- Positive sysfs NPU busy-time delta on at least one non-private inference probe, preferably embeddings
:18817because it is already live and central. - Journal/container logs only at summary level. Avoid raw prompts, raw OCR text, private document names, credentials, and API keys.
Useful log commands:
journalctl --user -u openvino-embeddings.service -n 100 --no-pager
journalctl --user -u rag-embedding-health.service -n 100 --no-pager
journalctl --user -u openvino-reranker.service -n 100 --no-pager
journalctl --user -u openvino-router-classifier.service -n 100 --no-pager
journalctl --user -u openvino-genai-npu-worker.service -n 100 --no-pager
cd ~/lab/swarm && docker compose logs --tail 100 whisper-server-npu
Approval gates
Requires explicit Will approval before proceeding:
- Installing, enabling, or autostarting
openvino-reranker.service,openvino-router-classifier.service, oropenvino-genai-npu-worker.service. - Assigning a final persistent port to document/image triage or enabling it as a persistent service.
- Enabling live RAG reranking or any request path that changes Atlas/RAG answers.
- Changing primary Atlas/Hermes routing or connecting router/classifier outputs to live decisions.
- Connecting the GenAI worker to primary Atlas chat, gateway routing, memory writes, or outbound notifications.
- Restarting the live Atlas/Hermes gateway.
- Deleting, overwriting, or in-place reindexing existing vector collections.
- Broadening bind addresses or exposure beyond local-only defaults.
Approved/parked outcomes:
- Built/approved prototypes: reranker (
:18818), router/classifier (:18819), small GenAI worker (:18820), document/image triage (review ports:18828/:18829). - Live baseline retained: Whisper NPU (
:18816), OpenVINO embeddings (:18817), RAG endpoint (:18810) usingobsidian_bge_npu. - Parked: always-on wake-word/audio and conventional vision detection until Will wants a concrete use case.
- Rejected for this NPU program: diffusion/image generation.