Files
swarm-master/docs/npu-integrated-health-ops.md
William Valentin 22e6ee90d2 docs(npu): document advisory observability gates
Add operator runbook and link integrated health docs for advisory-only observability, dry-run metrics, and future promotion criteria.
2026-06-06 15:30:31 -07:00

8.7 KiB

NPU integrated health checks — operator runbook notes

Compact, read-only operator workflow that combines the existing scripts/npu-service-health.sh listener/systemd/embedding-proof probe with the reviewer-approved scripts/npu-utilization-digest.py per-service utilization and fallback report. Together they form a single safe daily / on-demand NPU health pass.

Scope:

  • Read-only against live services. No restarts, route changes, vector mutation, advisory POSTs, outbound sends, or memory writes.
  • No new persistent services, timers, sockets, compose services, or Dockerfiles are introduced by this integration. Both scripts are foreground / on-demand.
  • Binds verified local-only or on the approved Docker bridge (172.19.0.1:18830). Pre-existing broader binds on the live baseline ports (18810, 18814, 18816, 18817) are noted in the runbook and unchanged here.
  • NPU proof requires real inference plus a positive /sys/class/accel/accel0/device/npu_busy_time_us delta. HTTP 200 alone is not sufficient.

When to run

  • Daily / on-demand ops check.
  • After upgrades that touch the NPU stack, OpenVINO, or any of the live specialists.
  • Before any approval-gated change that depends on the NPU reflex layer.
  • As the read-only verification step of a deploy or recovery runbook.

Required artifacts on the branch

Path Role
scripts/npu-service-health.sh Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script.
scripts/npu-utilization-digest.py Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact.
docs/npu-utilization-digest.md Per-service digest reference.
docs/npu-advisory-observability-runbook.md Dry-run comparison and later promotion criteria for advisory lanes.
tests/test_npu_utilization_digest.py Offline unit tests for the digest (no live services required).

Integrated workflow

Step 1 — Listener and service-state snapshot

cd ~/lab/swarm
./scripts/npu-service-health.sh

What it verifies, in order:

  1. npu_busy_time_us counter is readable.
  2. Required listeners are present on 18810 / 18814 / 18816 / 18817 / 18818 / 18819 / 18820 / 18829 / 18830.
  3. User systemd services are active/enabled for embeddings, RAG health, reranker, router/classifier, and the small GenAI worker.
  4. Docker Compose whisper-server-npu is up.
  5. Health endpoints return JSON for the live baseline and local specialists.
  6. A single non-private embeddings request to :18817 produces a positive sysfs npu_busy_time_us delta; the script exits nonzero if there is no positive delta.

Read the last block (== Embeddings NPU busy-time proof ==) first. If result=ok and sysfs_delta_us > 0, the central NPU path is healthy. If not, do not run the digest; triage the embeddings service first.

Step 2 — Per-service utilization digest

scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text

Compact output shape:

NPU utilization digest <timestamp>
counter=/sys/class/accel/accel0/device/npu_busy_time_us delta_us=<total>
services_ok=<ok>/<total> proof_ok=<ok>/<proof-capable> fallbacks=<n> gates_closed=<n>
- embeddings: ok=true calls=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- rerank:     ok=true calls=1 docs=2   avg_ms=... npu_delta_us=... proof=true mode=NPU
- whisper:    ok=true calls=1 jobs=1   avg_ms=... npu_delta_us=... proof=true mode=NPU
- classifier: ok=true calls=1 events=1 avg_ms=... npu_delta_us=... proof=true dry_run=true ...
- genai:      ok=true jobs=0 loaded=false mode=loaded=false reason=skipped_cold_load
- doc_triage: ok=true calls=1 files=1  avg_ms=... npu_delta_us=... proof=true gate=closed:private-root
- rag_endpoint:   ok=true mode=health_only gate=closed:vector-mutation
- rag_health:     ok=true mode=health_only
- advisory_gateway: ok=true mode=health_only gate=closed:advisory-post
fallbacks: skipped_cold_load=1

Read order for ops:

  1. services_ok row — anything below 9/9 means a service is down or unhealthy.
  2. proof_ok row — proof_ok=5/5 means every probe that ran with a real inference request produced a positive sysfs NPU delta.
  3. fallbacks: line — skipped_cold_load=1 is expected (GenAI worker is intentionally not cold-loaded). Any other fallback label is a triage signal.
  4. gate= labels — closed gates that remain closed by design.

Step 3 — Optional artifact for trend tracking

scripts/npu-utilization-digest.py --format jsonl

Writes a single JSONL line per digest under /home/will/.local/state/npu-utilization/digests/<timestamp>.jsonl. The first line is the summary; subsequent lines are per-service rows. No JSONL write happens with --no-write.

Step 4 — Offline unit tests

python -m pytest tests/test_npu_utilization_digest.py -q

Does not require live services. Use to validate digest logic after edits or before merging.

Compact proof interpretation

For each proof-capable service, both the response-level npu_busy_delta_us (when the service reports it) and the script's own sysfs before/after delta must agree and be > 0. The proof is only valid when an actual inference request ran. If a probe was skipped (reason=skipped_cold_load or reason=smoke_disabled), proof_ok for that row is None and the row contributes a labeled fallback instead of a proof failure.

Proof currently runs on:

  • embeddings (:18817)
  • rerank (:18818)
  • whisper (:18816) when --include-whisper-smoke=true (default)
  • classifier (:18819)
  • doc_triage (:18829) when --include-doc-triage-smoke=true (default); proof is via the embeddings service, not directly on the NPU device, so the row reports mode=NPU-via-embedding-service.

Intentionally health-only (no proof row):

  • rag_endpoint (:18810) — closed:vector-mutation
  • rag_health (:18814)
  • advisory_gateway (172.19.0.1:18830) — closed:advisory-post

Intentionally skipped by default:

  • genai (:18820) — loaded=false until first use; cold-loading just to prove the NPU is not free, so it is treated as a labeled fallback rather than a proof failure. Opt in with --include-genai-smoke=true only when the task actually needs a generation smoke.

Exit codes and triage gates

scripts/npu-service-health.sh:

Exit Meaning Next
0 All checks passed including embeddings proof. Continue to digest.
2 npu_busy_time_us not readable. Check kernel/driver; do not run digest.
3 Embedding request failed. Triage openvino-embeddings.service and port :18817.
4 Embedding request succeeded but sysfs delta <= 0. Service reachable but not on the NPU; check service logs and device bind.

scripts/npu-utilization-digest.py:

Exit Meaning Next
0 All reachable services handled; proof/fallback accounting completed. Inspect proof_ok and fallbacks: for any unexpected labels.
2 --strict-proof was set and at least one proof-required probe ran without a positive sysfs delta. Triage the named service's NPU path.

Approval gates left closed

The integrated workflow intentionally does not:

  • start, stop, restart, enable, or disable any user systemd unit or Docker Compose service;
  • write to or mutate the Chroma collection obsidian_bge_npu or any other vector store;
  • change Atlas/Hermes routing or model defaults;
  • post classification/generation/triage events to the advisory gateway;
  • broaden private document, image, or audio roots;
  • bind any new listener, including on 0.0.0.0;
  • write memory, send messages, execute tools, or mutate Kanban state.

These remain approval-gated and are tracked on the npu-maximization board.

For advisory-lane promotion decisions, pair this live utilization pass with the fixture-only dry-run comparison in docs/npu-advisory-observability-runbook.md. The digest can show whether live NPU services are healthy enough to collect evidence; it does not promote advisory outputs into authority. Promotion remains a separate lane-specific approval with explicit scope and rollback.

Quick reference

# Single-pass NPU health check (listener + systemd + embeddings proof).
cd ~/lab/swarm && ./scripts/npu-service-health.sh

# Compact digest with per-service proof and fallback accounting.
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text

# Same, with a JSONL artifact for trend tracking.
scripts/npu-utilization-digest.py --format jsonl

# Strict mode for CI / pre-merge.
scripts/npu-utilization-digest.py --no-write --strict-proof

# Offline digest logic tests.
python -m pytest tests/test_npu_utilization_digest.py -q