Files

T

William Valentin 22e6ee90d2 docs(npu): document advisory observability gates

Add operator runbook and link integrated health docs for advisory-only observability, dry-run metrics, and future promotion criteria.

2026-06-06 15:30:31 -07:00

8.7 KiB

Raw Permalink Blame History

NPU integrated health checks — operator runbook notes

Compact, read-only operator workflow that combines the existing scripts/npu-service-health.sh listener/systemd/embedding-proof probe with the reviewer-approved scripts/npu-utilization-digest.py per-service utilization and fallback report. Together they form a single safe daily / on-demand NPU health pass.

Scope:

Read-only against live services. No restarts, route changes, vector mutation, advisory POSTs, outbound sends, or memory writes.
No new persistent services, timers, sockets, compose services, or Dockerfiles are introduced by this integration. Both scripts are foreground / on-demand.
Binds verified local-only or on the approved Docker bridge (172.19.0.1:18830). Pre-existing broader binds on the live baseline ports (18810, 18814, 18816, 18817) are noted in the runbook and unchanged here.
NPU proof requires real inference plus a positive /sys/class/accel/accel0/device/npu_busy_time_us delta. HTTP 200 alone is not sufficient.

When to run

Daily / on-demand ops check.
After upgrades that touch the NPU stack, OpenVINO, or any of the live specialists.
Before any approval-gated change that depends on the NPU reflex layer.
As the read-only verification step of a deploy or recovery runbook.

Required artifacts on the branch

Path	Role
`scripts/npu-service-health.sh`	Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script.
`scripts/npu-utilization-digest.py`	Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact.
`docs/npu-utilization-digest.md`	Per-service digest reference.
`docs/npu-advisory-observability-runbook.md`	Dry-run comparison and later promotion criteria for advisory lanes.
`tests/test_npu_utilization_digest.py`	Offline unit tests for the digest (no live services required).

Integrated workflow

Step 1 — Listener and service-state snapshot

cd ~/lab/swarm
./scripts/npu-service-health.sh

What it verifies, in order:

npu_busy_time_us counter is readable.
Required listeners are present on 18810 / 18814 / 18816 / 18817 / 18818 / 18819 / 18820 / 18829 / 18830.
User systemd services are active/enabled for embeddings, RAG health, reranker, router/classifier, and the small GenAI worker.
Docker Compose whisper-server-npu is up.
Health endpoints return JSON for the live baseline and local specialists.
A single non-private embeddings request to :18817 produces a positive sysfs npu_busy_time_us delta; the script exits nonzero if there is no positive delta.

Read the last block (== Embeddings NPU busy-time proof ==) first. If result=ok and sysfs_delta_us > 0, the central NPU path is healthy. If not, do not run the digest; triage the embeddings service first.

Step 2 — Per-service utilization digest

scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text

Compact output shape:

NPU utilization digest <timestamp>
counter=/sys/class/accel/accel0/device/npu_busy_time_us delta_us=<total>
services_ok=<ok>/<total> proof_ok=<ok>/<proof-capable> fallbacks=<n> gates_closed=<n>
- embeddings: ok=true calls=1 avg_ms=... npu_delta_us=... proof=true mode=NPU
- rerank:     ok=true calls=1 docs=2   avg_ms=... npu_delta_us=... proof=true mode=NPU
- whisper:    ok=true calls=1 jobs=1   avg_ms=... npu_delta_us=... proof=true mode=NPU
- classifier: ok=true calls=1 events=1 avg_ms=... npu_delta_us=... proof=true dry_run=true ...
- genai:      ok=true jobs=0 loaded=false mode=loaded=false reason=skipped_cold_load
- doc_triage: ok=true calls=1 files=1  avg_ms=... npu_delta_us=... proof=true gate=closed:private-root
- rag_endpoint:   ok=true mode=health_only gate=closed:vector-mutation
- rag_health:     ok=true mode=health_only
- advisory_gateway: ok=true mode=health_only gate=closed:advisory-post
fallbacks: skipped_cold_load=1

Read order for ops:

services_ok row — anything below 9/9 means a service is down or unhealthy.
proof_ok row — proof_ok=5/5 means every probe that ran with a real inference request produced a positive sysfs NPU delta.
fallbacks: line — skipped_cold_load=1 is expected (GenAI worker is intentionally not cold-loaded). Any other fallback label is a triage signal.
gate= labels — closed gates that remain closed by design.

Step 3 — Optional artifact for trend tracking

scripts/npu-utilization-digest.py --format jsonl

Writes a single JSONL line per digest under /home/will/.local/state/npu-utilization/digests/<timestamp>.jsonl. The first line is the summary; subsequent lines are per-service rows. No JSONL write happens with --no-write.

Step 4 — Offline unit tests

python -m pytest tests/test_npu_utilization_digest.py -q

Does not require live services. Use to validate digest logic after edits or before merging.

Compact proof interpretation

For each proof-capable service, both the response-level npu_busy_delta_us (when the service reports it) and the script's own sysfs before/after delta must agree and be > 0. The proof is only valid when an actual inference request ran. If a probe was skipped (reason=skipped_cold_load or reason=smoke_disabled), proof_ok for that row is None and the row contributes a labeled fallback instead of a proof failure.

Proof currently runs on:

embeddings (:18817)
rerank (:18818)
whisper (:18816) when --include-whisper-smoke=true (default)
classifier (:18819)
doc_triage (:18829) when --include-doc-triage-smoke=true (default); proof is via the embeddings service, not directly on the NPU device, so the row reports mode=NPU-via-embedding-service.

Intentionally health-only (no proof row):

rag_endpoint (:18810) — closed:vector-mutation
rag_health (:18814)
advisory_gateway (172.19.0.1:18830) — closed:advisory-post

Intentionally skipped by default:

genai (:18820) — loaded=false until first use; cold-loading just to prove the NPU is not free, so it is treated as a labeled fallback rather than a proof failure. Opt in with --include-genai-smoke=true only when the task actually needs a generation smoke.

Exit codes and triage gates

scripts/npu-service-health.sh:

Exit	Meaning	Next
0	All checks passed including embeddings proof.	Continue to digest.
2	`npu_busy_time_us` not readable.	Check kernel/driver; do not run digest.
3	Embedding request failed.	Triage `openvino-embeddings.service` and port `:18817`.
4	Embedding request succeeded but sysfs delta `<= 0`.	Service reachable but not on the NPU; check service logs and device bind.

scripts/npu-utilization-digest.py:

Exit	Meaning	Next
0	All reachable services handled; proof/fallback accounting completed.	Inspect `proof_ok` and `fallbacks:` for any unexpected labels.
2	`--strict-proof` was set and at least one proof-required probe ran without a positive sysfs delta.	Triage the named service's NPU path.

Approval gates left closed

The integrated workflow intentionally does not:

start, stop, restart, enable, or disable any user systemd unit or Docker Compose service;
write to or mutate the Chroma collection obsidian_bge_npu or any other vector store;
change Atlas/Hermes routing or model defaults;
post classification/generation/triage events to the advisory gateway;
broaden private document, image, or audio roots;
bind any new listener, including on 0.0.0.0;
write memory, send messages, execute tools, or mutate Kanban state.

These remain approval-gated and are tracked on the npu-maximization board.

For advisory-lane promotion decisions, pair this live utilization pass with the fixture-only dry-run comparison in docs/npu-advisory-observability-runbook.md. The digest can show whether live NPU services are healthy enough to collect evidence; it does not promote advisory outputs into authority. Promotion remains a separate lane-specific approval with explicit scope and rollback.

Quick reference

# Single-pass NPU health check (listener + systemd + embeddings proof).
cd ~/lab/swarm && ./scripts/npu-service-health.sh

# Compact digest with per-service proof and fallback accounting.
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text

# Same, with a JSONL artifact for trend tracking.
scripts/npu-utilization-digest.py --format jsonl

# Strict mode for CI / pre-merge.
scripts/npu-utilization-digest.py --no-write --strict-proof

# Offline digest logic tests.
python -m pytest tests/test_npu_utilization_digest.py -q

8.7 KiB Raw Permalink Blame History