# NPU integrated health checks — operator runbook notes Compact, read-only operator workflow that combines the existing `scripts/npu-service-health.sh` listener/systemd/embedding-proof probe with the reviewer-approved `scripts/npu-utilization-digest.py` per-service utilization and fallback report. Together they form a single safe daily / on-demand NPU health pass. Scope: - Read-only against live services. No restarts, route changes, vector mutation, advisory POSTs, outbound sends, or memory writes. - No new persistent services, timers, sockets, compose services, or Dockerfiles are introduced by this integration. Both scripts are foreground / on-demand. - Binds verified local-only or on the approved Docker bridge (`172.19.0.1:18830`). Pre-existing broader binds on the live baseline ports (`18810`, `18814`, `18816`, `18817`) are noted in the runbook and unchanged here. - NPU proof requires real inference plus a positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta. HTTP 200 alone is not sufficient. ## When to run - Daily / on-demand ops check. - After upgrades that touch the NPU stack, OpenVINO, or any of the live specialists. - Before any approval-gated change that depends on the NPU reflex layer. - As the read-only verification step of a deploy or recovery runbook. ## Required artifacts on the branch | Path | Role | | --- | --- | | `scripts/npu-service-health.sh` | Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script. | | `scripts/npu-utilization-digest.py` | Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact. | | `docs/npu-utilization-digest.md` | Per-service digest reference. | | `docs/npu-advisory-observability-runbook.md` | Dry-run comparison and later promotion criteria for advisory lanes. | | `tests/test_npu_utilization_digest.py` | Offline unit tests for the digest (no live services required). | ## Integrated workflow ### Step 1 — Listener and service-state snapshot ```bash cd ~/lab/swarm ./scripts/npu-service-health.sh ``` What it verifies, in order: 1. `npu_busy_time_us` counter is readable. 2. Required listeners are present on `18810 / 18814 / 18816 / 18817 / 18818 / 18819 / 18820 / 18829 / 18830`. 3. User systemd services are active/enabled for embeddings, RAG health, reranker, router/classifier, and the small GenAI worker. 4. Docker Compose `whisper-server-npu` is up. 5. Health endpoints return JSON for the live baseline and local specialists. 6. A single non-private embeddings request to `:18817` produces a positive sysfs `npu_busy_time_us` delta; the script exits nonzero if there is no positive delta. Read the last block (`== Embeddings NPU busy-time proof ==`) first. If `result=ok` and `sysfs_delta_us > 0`, the central NPU path is healthy. If not, do not run the digest; triage the embeddings service first. ### Step 2 — Per-service utilization digest ```bash scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text ``` Compact output shape: ```text NPU utilization digest counter=/sys/class/accel/accel0/device/npu_busy_time_us delta_us= services_ok=/ proof_ok=/ fallbacks= gates_closed= - embeddings: ok=true calls=1 avg_ms=... npu_delta_us=... proof=true mode=NPU - rerank: ok=true calls=1 docs=2 avg_ms=... npu_delta_us=... proof=true mode=NPU - whisper: ok=true calls=1 jobs=1 avg_ms=... npu_delta_us=... proof=true mode=NPU - classifier: ok=true calls=1 events=1 avg_ms=... npu_delta_us=... proof=true dry_run=true ... - genai: ok=true jobs=0 loaded=false mode=loaded=false reason=skipped_cold_load - doc_triage: ok=true calls=1 files=1 avg_ms=... npu_delta_us=... proof=true gate=closed:private-root - rag_endpoint: ok=true mode=health_only gate=closed:vector-mutation - rag_health: ok=true mode=health_only - advisory_gateway: ok=true mode=health_only gate=closed:advisory-post fallbacks: skipped_cold_load=1 ``` Read order for ops: 1. `services_ok` row — anything below `9/9` means a service is down or unhealthy. 2. `proof_ok` row — `proof_ok=5/5` means every probe that ran with a real inference request produced a positive sysfs NPU delta. 3. `fallbacks:` line — `skipped_cold_load=1` is expected (GenAI worker is intentionally not cold-loaded). Any other fallback label is a triage signal. 4. `gate=` labels — closed gates that remain closed by design. ### Step 3 — Optional artifact for trend tracking ```bash scripts/npu-utilization-digest.py --format jsonl ``` Writes a single JSONL line per digest under `/home/will/.local/state/npu-utilization/digests/.jsonl`. The first line is the summary; subsequent lines are per-service rows. No JSONL write happens with `--no-write`. ### Step 4 — Offline unit tests ```bash python -m pytest tests/test_npu_utilization_digest.py -q ``` Does not require live services. Use to validate digest logic after edits or before merging. ## Compact proof interpretation For each proof-capable service, both the response-level `npu_busy_delta_us` (when the service reports it) and the script's own sysfs before/after delta must agree and be `> 0`. The proof is only valid when an actual inference request ran. If a probe was skipped (`reason=skipped_cold_load` or `reason=smoke_disabled`), `proof_ok` for that row is `None` and the row contributes a labeled fallback instead of a proof failure. Proof currently runs on: - `embeddings` (`:18817`) - `rerank` (`:18818`) - `whisper` (`:18816`) when `--include-whisper-smoke=true` (default) - `classifier` (`:18819`) - `doc_triage` (`:18829`) when `--include-doc-triage-smoke=true` (default); proof is via the embeddings service, not directly on the NPU device, so the row reports `mode=NPU-via-embedding-service`. Intentionally health-only (no proof row): - `rag_endpoint` (`:18810`) — closed:vector-mutation - `rag_health` (`:18814`) - `advisory_gateway` (`172.19.0.1:18830`) — closed:advisory-post Intentionally skipped by default: - `genai` (`:18820`) — `loaded=false` until first use; cold-loading just to prove the NPU is not free, so it is treated as a labeled fallback rather than a proof failure. Opt in with `--include-genai-smoke=true` only when the task actually needs a generation smoke. ## Exit codes and triage gates `scripts/npu-service-health.sh`: | Exit | Meaning | Next | | ---: | --- | --- | | 0 | All checks passed including embeddings proof. | Continue to digest. | | 2 | `npu_busy_time_us` not readable. | Check kernel/driver; do not run digest. | | 3 | Embedding request failed. | Triage `openvino-embeddings.service` and port `:18817`. | | 4 | Embedding request succeeded but sysfs delta `<= 0`. | Service reachable but not on the NPU; check service logs and device bind. | `scripts/npu-utilization-digest.py`: | Exit | Meaning | Next | | ---: | --- | --- | | 0 | All reachable services handled; proof/fallback accounting completed. | Inspect `proof_ok` and `fallbacks:` for any unexpected labels. | | 2 | `--strict-proof` was set and at least one proof-required probe ran without a positive sysfs delta. | Triage the named service's NPU path. | ## Approval gates left closed The integrated workflow intentionally does not: - start, stop, restart, enable, or disable any user systemd unit or Docker Compose service; - write to or mutate the Chroma collection `obsidian_bge_npu` or any other vector store; - change Atlas/Hermes routing or model defaults; - post classification/generation/triage events to the advisory gateway; - broaden private document, image, or audio roots; - bind any new listener, including on `0.0.0.0`; - write memory, send messages, execute tools, or mutate Kanban state. These remain approval-gated and are tracked on the `npu-maximization` board. For advisory-lane promotion decisions, pair this live utilization pass with the fixture-only dry-run comparison in `docs/npu-advisory-observability-runbook.md`. The digest can show whether live NPU services are healthy enough to collect evidence; it does not promote advisory outputs into authority. Promotion remains a separate lane-specific approval with explicit scope and rollback. ## Quick reference ```bash # Single-pass NPU health check (listener + systemd + embeddings proof). cd ~/lab/swarm && ./scripts/npu-service-health.sh # Compact digest with per-service proof and fallback accounting. scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text # Same, with a JSONL artifact for trend tracking. scripts/npu-utilization-digest.py --format jsonl # Strict mode for CI / pre-merge. scripts/npu-utilization-digest.py --no-write --strict-proof # Offline digest logic tests. python -m pytest tests/test_npu_utilization_digest.py -q ```