docs(npu): update integrated health runbooks

This commit is contained in:
William Valentin
2026-06-05 15:52:51 -07:00
parent 9e5ffa0fd0
commit 08fb9ca686
2 changed files with 323 additions and 134 deletions
+201
View File
@@ -0,0 +1,201 @@
# NPU integrated health checks — operator runbook notes
Compact, read-only operator workflow that combines the existing
`scripts/npu-service-health.sh` listener/systemd/embedding-proof probe with the
reviewer-approved `scripts/npu-utilization-digest.py` per-service utilization
and fallback report. Together they form a single safe daily / on-demand NPU
health pass.
Scope:
- Read-only against live services. No restarts, route changes, vector mutation,
advisory POSTs, outbound sends, or memory writes.
- No new persistent services, timers, sockets, compose services, or Dockerfiles
are introduced by this integration. Both scripts are foreground / on-demand.
- Binds verified local-only or on the approved Docker bridge (`172.19.0.1:18830`).
Pre-existing broader binds on the live baseline ports (`18810`, `18814`,
`18816`, `18817`) are noted in the runbook and unchanged here.
- NPU proof requires real inference plus a positive
`/sys/class/accel/accel0/device/npu_busy_time_us` delta. HTTP 200 alone is
not sufficient.
## When to run
- Daily / on-demand ops check.
- After upgrades that touch the NPU stack, OpenVINO, or any of the live
specialists.
- Before any approval-gated change that depends on the NPU reflex layer.
- As the read-only verification step of a deploy or recovery runbook.
## Required artifacts on the branch
| Path | Role |
| --- | --- |
| `scripts/npu-service-health.sh` | Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script. |
| `scripts/npu-utilization-digest.py` | Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact. |
| `docs/npu-utilization-digest.md` | Per-service digest reference. |
| `tests/test_npu_utilization_digest.py` | Offline unit tests for the digest (no live services required). |
## Integrated workflow
### Step 1 — Listener and service-state snapshot
```bash
cd ~/lab/swarm
./scripts/npu-service-health.sh
```
What it verifies, in order:
1. `npu_busy_time_us` counter is readable.
2. Required listeners are present on `18810 / 18814 / 18816 / 18817 / 18818 /
18819 / 18820 / 18829 / 18830`.
3. User systemd services are active/enabled for embeddings, RAG health,
reranker, router/classifier, and the small GenAI worker.
4. Docker Compose `whisper-server-npu` is up.
5. Health endpoints return JSON for the live baseline and local specialists.
6. A single non-private embeddings request to `:18817` produces a positive
sysfs `npu_busy_time_us` delta; the script exits nonzero if there is no
positive delta.
Read the last block (`== Embeddings NPU busy-time proof ==`) first. If
`result=ok` and `sysfs_delta_us > 0`, the central NPU path is healthy. If not,
do not run the digest; triage the embeddings service first.
### Step 2 — Per-service utilization digest
```bash
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
```
Compact output shape:
```text
NPU utilization digest <timestamp>
counter=/sys/class/accel/accel0/device/npu_busy_time_us delta_us=<total>
services_ok=<ok>/<total> proof_ok=<ok>/<proof-capable> fallbacks=<n> gates_closed=<n>
- embeddings: ok=true calls=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- rerank: ok=true calls=1 docs=2 avg_ms=... npu_delta_us=... proof=true mode=NPU
- whisper: ok=true calls=1 jobs=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- classifier: ok=true calls=1 events=1 avg_ms=... npu_delta_us=... proof=true dry_run=true ...
- genai: ok=true jobs=0 loaded=false mode=loaded=false reason=skipped_cold_load
- doc_triage: ok=true calls=1 files=1 avg_ms=... npu_delta_us=... proof=true gate=closed:private-root
- rag_endpoint: ok=true mode=health_only gate=closed:vector-mutation
- rag_health: ok=true mode=health_only
- advisory_gateway: ok=true mode=health_only gate=closed:advisory-post
fallbacks: skipped_cold_load=1
```
Read order for ops:
1. `services_ok` row — anything below `9/9` means a service is down or unhealthy.
2. `proof_ok` row — `proof_ok=5/5` means every probe that ran with a real
inference request produced a positive sysfs NPU delta.
3. `fallbacks:` line — `skipped_cold_load=1` is expected (GenAI worker is
intentionally not cold-loaded). Any other fallback label is a triage signal.
4. `gate=` labels — closed gates that remain closed by design.
### Step 3 — Optional artifact for trend tracking
```bash
scripts/npu-utilization-digest.py --format jsonl
```
Writes a single JSONL line per digest under
`/home/will/.local/state/npu-utilization/digests/<timestamp>.jsonl`. The first
line is the summary; subsequent lines are per-service rows. No JSONL write
happens with `--no-write`.
### Step 4 — Offline unit tests
```bash
python -m pytest tests/test_npu_utilization_digest.py -q
```
Does not require live services. Use to validate digest logic after edits or
before merging.
## Compact proof interpretation
For each proof-capable service, both the response-level `npu_busy_delta_us`
(when the service reports it) and the script's own sysfs before/after delta
must agree and be `> 0`. The proof is only valid when an actual inference
request ran. If a probe was skipped (`reason=skipped_cold_load` or
`reason=smoke_disabled`), `proof_ok` for that row is `None` and the row
contributes a labeled fallback instead of a proof failure.
Proof currently runs on:
- `embeddings` (`:18817`)
- `rerank` (`:18818`)
- `whisper` (`:18816`) when `--include-whisper-smoke=true` (default)
- `classifier` (`:18819`)
- `doc_triage` (`:18829`) when `--include-doc-triage-smoke=true` (default);
proof is via the embeddings service, not directly on the NPU device, so the
row reports `mode=NPU-via-embedding-service`.
Intentionally health-only (no proof row):
- `rag_endpoint` (`:18810`) — closed:vector-mutation
- `rag_health` (`:18814`)
- `advisory_gateway` (`172.19.0.1:18830`) — closed:advisory-post
Intentionally skipped by default:
- `genai` (`:18820`) — `loaded=false` until first use; cold-loading just to
prove the NPU is not free, so it is treated as a labeled fallback rather
than a proof failure. Opt in with `--include-genai-smoke=true` only when the
task actually needs a generation smoke.
## Exit codes and triage gates
`scripts/npu-service-health.sh`:
| Exit | Meaning | Next |
| ---: | --- | --- |
| 0 | All checks passed including embeddings proof. | Continue to digest. |
| 2 | `npu_busy_time_us` not readable. | Check kernel/driver; do not run digest. |
| 3 | Embedding request failed. | Triage `openvino-embeddings.service` and port `:18817`. |
| 4 | Embedding request succeeded but sysfs delta `<= 0`. | Service reachable but not on the NPU; check service logs and device bind. |
`scripts/npu-utilization-digest.py`:
| Exit | Meaning | Next |
| ---: | --- | --- |
| 0 | All reachable services handled; proof/fallback accounting completed. | Inspect `proof_ok` and `fallbacks:` for any unexpected labels. |
| 2 | `--strict-proof` was set and at least one proof-required probe ran without a positive sysfs delta. | Triage the named service's NPU path. |
## Approval gates left closed
The integrated workflow intentionally does not:
- start, stop, restart, enable, or disable any user systemd unit or Docker
Compose service;
- write to or mutate the Chroma collection `obsidian_bge_npu` or any other
vector store;
- change Atlas/Hermes routing or model defaults;
- post classification/generation/triage events to the advisory gateway;
- broaden private document, image, or audio roots;
- bind any new listener, including on `0.0.0.0`;
- write memory, send messages, execute tools, or mutate Kanban state.
These remain approval-gated and are tracked on the `npu-maximization` board.
## Quick reference
```bash
# Single-pass NPU health check (listener + systemd + embeddings proof).
cd ~/lab/swarm && ./scripts/npu-service-health.sh
# Compact digest with per-service proof and fallback accounting.
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
# Same, with a JSONL artifact for trend tracking.
scripts/npu-utilization-digest.py --format jsonl
# Strict mode for CI / pre-merge.
scripts/npu-utilization-digest.py --no-write --strict-proof
# Offline digest logic tests.
python -m pytest tests/test_npu_utilization_digest.py -q
```