docs(npu): document VLM audio wake-word feasibility

2026-06-04 13:07:51 -07:00
parent 2ef9e3dfd2
commit 703c1df860
1 changed files with 388 additions and 0 deletions
@@ -0,0 +1,388 @@
+# OpenVINO/NPU VLM, audio, and wake-word feasibility
+
+Date: 2026-06-04
+Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.
+
+## Existing baseline and constraints
+
+Live baseline discovered by parent task:
+
+- RAG endpoint: `127.0.0.1:18810`
+- RAG health wrapper: `127.0.0.1:18814`
+- Whisper OpenVINO NPU: `127.0.0.1:18816`
+- OpenVINO embeddings: `127.0.0.1:18817`
+- Prototype ports currently reserved/not live: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, optional doc/image triage `:18829`
+
+Local NPU runtime snapshot from the feasibility run:
+
+- `/home/will/.venvs/npu` has `openvino==2026.2.0` and `openvino-genai==2026.2.0.0`.
+- `openvino.Core().available_devices` reports `CPU`, `GPU.0`, `GPU.1`, and `NPU`.
+- NPU device name: `Intel(R) AI Boost`.
+- NPU claims must be verified by positive `/sys/class/accel/accel0/device/npu_busy_time_us` deltas around inference.
+
+External release/project signals checked:
+
+- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
+- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
+- `openWakeWord` is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.
+
+## Recommendation summary
+
+| Lane | Recommendation | Priority | Why |
+| --- | --- | --- | --- |
+| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. |
+| Lightweight image classification / caption fallback | Extend the existing `openvino-doc-image-triage-npu` lane before adding a new service. | Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. |
+| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. |
+| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. |
+
+## VLM / image-captioning path
+
+### Recommended model/runtime
+
+Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.
+
+Candidate models to evaluate, in order:
+
+1. `Qwen2-VL-2B-Instruct` OpenVINO/OVMS-compatible export if a small converted artifact is already available.
+2. `Phi-3.5-Vision-Instruct` only if memory/startup is acceptable.
+3. `InternVL2` only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.
+
+Why this order:
+
+- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
+- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
+- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.
+
+### Endpoint/CLI contract
+
+CLI-first contract:
+
+```bash
+python vlm_caption.py \
+  --image /path/to/synthetic_or_explicitly_allowed_image.png \
+  --prompt "Describe this image in one sentence." \
+  --device CPU \
+  --max-new-tokens 96 \
+  --json
+```
+
+Response shape:
+
+```json
+{
+  "ok": true,
+  "media_type": "image",
+  "source_path_basename": "synthetic_scene.png",
+  "source_sha256": "sha256:...",
+  "model": "qwen2-vl-small-openvino",
+  "runtime": "openvino-genai-or-ovms",
+  "device_requested": "CPU",
+  "device_observed": "CPU",
+  "caption": "A synthetic chart with three colored bars.",
+  "safety": {
+    "external_uploads": false,
+    "raw_image_logged": false,
+    "private_paths_allowed": false
+  },
+  "timing_ms": {
+    "load": 0,
+    "inference": 0,
+    "total": 0
+  },
+  "npu_busy_delta_us": null
+}
+```
+
+Optional localhost HTTP contract, only after CLI is stable:
+
+- Bind: `127.0.0.1:18829` or another explicitly approved unused prototype port.
+- `GET /healthz`
+- `GET /models`
+- `POST /v1/vision/caption`
+
+Request body:
+
+```json
+{
+  "path": "/allowed/root/synthetic_scene.png",
+  "prompt": "Describe this image in one sentence.",
+  "max_new_tokens": 96,
+  "device": "CPU"
+}
+```
+
+### Smoke-test plan using non-private data
+
+Use only generated fixtures under the repo, similar to `openvino-doc-image-triage-npu/samples/`:
+
+1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
+2. Run CLI with `--allowed-root "$PWD/samples"` and assert:
+   - JSON parses.
+   - `external_uploads=false`.
+   - only basename and SHA-256 are returned by default.
+   - captions are non-empty and under a configured token/character limit.
+   - unsupported/private paths are rejected.
+3. If an HTTP server is added, start it in foreground on `127.0.0.1`, call `/healthz` and `/v1/vision/caption`, then stop it.
+4. No private image/document folders and no Obsidian vault content should be used for smoke tests.
+
+### NPU busy-time verification plan
+
+Only claim NPU VLM if all of these pass:
+
+1. Verify the counter is readable:
+
+```bash
+BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
+test -r "$BUSY" && before=$(cat "$BUSY")
+```
+
+2. Run exactly one synthetic-image inference with `device=NPU`.
+3. Read `after=$(cat "$BUSY")`.
+4. Require `after - before > 0` and a response-level `npu_busy_delta_us > 0` if the server reports it.
+5. Repeat with a second synthetic image to avoid counting unrelated startup activity only.
+6. If HTTP returns 200 but the sysfs delta is zero, document as `NPU not verified` and do not call it an NPU service.
+
+### No-go / defer criteria
+
+Defer VLM NPU work if any apply:
+
+- Model export/compile to NPU fails or requires unsupported ops/custom patches.
+- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
+- NPU busy-time delta is zero or inconsistent.
+- Memory pressure disrupts Whisper `:18816`, embeddings `:18817`, or RAG `:18810`.
+- The only useful path requires processing private images/docs before synthetic smoke tests are stable.
+- Captions are too hallucination-prone for automation decisions without a human-review gate.
+
+## Lightweight image triage/classification path
+
+### Recommended model/runtime
+
+Recommended near-term path: keep `openvino-doc-image-triage-npu` as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.
+
+Candidate classifier families for a later task:
+
+- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
+- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
+- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.
+
+Why:
+
+- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
+- A small classifier is much lower risk than a VLM and can be used for labels like `screenshot`, `receipt`, `document`, `photo`, `chart`.
+
+### Endpoint/CLI contract
+
+Extend existing CLI shape rather than introduce a new daemon:
+
+```bash
+/home/will/.venvs/npu/bin/python triage.py \
+  --allowed-root "$PWD" \
+  --image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
+  --image-classifier-device NPU \
+  --pretty \
+  samples/synthetic_invoice.png
+```
+
+Response addition:
+
+```json
+{
+  "classification": {
+    "label": "receipt_or_invoice",
+    "confidence": 0.82,
+    "device": "NPU",
+    "method": "openvino_image_classifier",
+    "npu_busy_delta_us": 12345
+  }
+}
+```
+
+### Smoke-test plan
+
+Reuse `openvino-doc-image-triage-npu/make_samples.py` and `tests/smoke_test.py`; add synthetic image-label assertions only after a classifier model exists. Keep `--no-embeddings` mode available so the smoke suite can separate classifier NPU proof from embeddings `:18817` proof.
+
+### No-go / defer criteria
+
+- Static-shape classifier cannot compile on NPU.
+- Labels are not useful enough to drive an assistant workflow.
+- Classifier output duplicates the existing rule-based fallback.
+
+## Audio classification path
+
+### Recommended model/runtime
+
+Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.
+
+Potential model classes:
+
+- Speech Commands keyword classifier for short command categories.
+- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
+- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU `:18816`.
+
+Why:
+
+- The system already has local Whisper NPU transcription.
+- Generic audio classification needs careful threshold tuning and false-positive analysis.
+- Always-on audio processing has privacy and resource implications; keep it explicit and local.
+
+### CLI contract
+
+```bash
+python audio_classify.py \
+  --input samples/synthetic_chime.wav \
+  --model /home/will/models/openvino-audio-classifier/model.xml \
+  --device CPU \
+  --json
+```
+
+Response shape:
+
+```json
+{
+  "ok": true,
+  "source_path_basename": "synthetic_chime.wav",
+  "source_sha256": "sha256:...",
+  "sample_rate": 16000,
+  "duration_seconds": 1.2,
+  "labels": [
+    {"label": "chime", "confidence": 0.76}
+  ],
+  "device_requested": "CPU",
+  "device_observed": "CPU",
+  "npu_busy_delta_us": null,
+  "privacy": {"external_uploads": false, "raw_audio_logged": false}
+}
+```
+
+Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.
+
+### Smoke-test plan using non-private data
+
+1. Generate synthetic WAV files in repo-local `samples/`: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available.
+2. Run CLI on each file with `--allowed-root "$PWD/samples"`.
+3. Assert JSON parses, durations are bounded, and confidence values are numeric.
+4. Do not stream microphone input or scan private audio directories in smoke tests.
+5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads.
+
+### No-go / defer criteria
+
+- No concrete downstream automation consumes the labels.
+- False positives cannot be characterized on synthetic/public fixtures.
+- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.
+
+## Wake-word path
+
+### Recommended model/runtime
+
+Recommended first runtime: CPU-only `openWakeWord` CLI/foreground process with ONNX Runtime or TFLite backend.
+
+NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.
+
+Why:
+
+- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
+- The model is small enough that CPU is likely acceptable and simpler.
+- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
+- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.
+
+### Endpoint/CLI contract
+
+CLI smoke contract:
+
+```bash
+python wake_word_smoke.py \
+  --model hey_jarvis \
+  --positive samples/synthetic_wake_positive.wav \
+  --negative samples/synthetic_noise.wav \
+  --threshold 0.5 \
+  --json
+```
+
+Foreground local stream contract, only for manual experiments:
+
+```bash
+python wake_word_listen.py \
+  --model hey_jarvis \
+  --threshold 0.5 \
+  --vad-threshold 0.3 \
+  --oneshot \
+  --json
+```
+
+Response/event shape:
+
+```json
+{
+  "ok": true,
+  "model": "hey_jarvis",
+  "runtime": "openwakeword-onnxruntime-or-tflite",
+  "device": "CPU",
+  "threshold": 0.5,
+  "events": [
+    {"offset_ms": 1280, "score": 0.83, "detected": true}
+  ],
+  "false_positive_count": 0,
+  "npu_busy_delta_us": null,
+  "privacy": {"external_uploads": false, "raw_audio_logged": false}
+}
+```
+
+If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:
+
+- `GET /healthz`
+- `POST /v1/wakeword/evaluate-file` for explicit files under allowed roots
+- `GET /v1/wakeword/events` for a manually started foreground listener
+
+### Smoke-test plan using non-private data
+
+1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
+
+```bash
+python -m venv /tmp/openwakeword-smoke-venv
+/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
+```
+
+2. Use public/generated WAVs only:
+   - Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
+   - Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
+3. Assert no false positives over a bounded negative fixture set.
+4. Measure per-frame CPU latency and max RSS.
+5. Do not start a persistent microphone listener; manual foreground `--oneshot` only if explicitly approved.
+
+### NPU busy-time verification plan
+
+Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:
+
+1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before a bounded file evaluation.
+2. Run NPU inference on a fixed set of WAV frames.
+3. Read the counter after inference.
+4. Require positive delta and stable predictions matching CPU baseline.
+5. Also verify that keeping the wake-word loop active does not starve Whisper `:18816` or embeddings `:18817`.
+
+### No-go / defer criteria
+
+- CPU openWakeWord has unacceptable false positives on local negative fixtures.
+- A usable positive fixture cannot be created without recording private audio.
+- Always-on microphone capture is required before explicit approval.
+- NPU conversion changes scores materially from CPU baseline.
+- NPU loop increases contention with Whisper/embedding services.
+
+## Docs and diagram implications
+
+If these lanes advance beyond feasibility:
+
+1. Update `docs/swarm-infrastructure.md` and `docs/swarm-infrastructure.html` to keep live vs prototype labels clear.
+2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
+3. Update the Service Catalog only after a service is actually approved/live; until then list as `prototype/not live` or omit.
+4. Architecture diagrams may show:
+   - live: RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`;
+   - prototypes: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, doc/image triage optional `:18829`;
+   - VLM/audio/wake-word as `CLI feasibility / not live` unless a later implementation task creates a service.
+5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.
+
+## Overall go/no-go decision
+
+- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
+- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
+- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
+- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.