From 703c1df8604f39df0366104fb80ade71477e33ee Mon Sep 17 00:00:00 2001 From: William Valentin Date: Thu, 4 Jun 2026 13:07:51 -0700 Subject: [PATCH] docs(npu): document VLM audio wake-word feasibility --- ...openvino-vlm-audio-wakeword-feasibility.md | 388 ++++++++++++++++++ 1 file changed, 388 insertions(+) create mode 100644 docs/openvino-vlm-audio-wakeword-feasibility.md diff --git a/docs/openvino-vlm-audio-wakeword-feasibility.md b/docs/openvino-vlm-audio-wakeword-feasibility.md new file mode 100644 index 0000000..68aaf1a --- /dev/null +++ b/docs/openvino-vlm-audio-wakeword-feasibility.md @@ -0,0 +1,388 @@ +# OpenVINO/NPU VLM, audio, and wake-word feasibility + +Date: 2026-06-04 +Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories. + +## Existing baseline and constraints + +Live baseline discovered by parent task: + +- RAG endpoint: `127.0.0.1:18810` +- RAG health wrapper: `127.0.0.1:18814` +- Whisper OpenVINO NPU: `127.0.0.1:18816` +- OpenVINO embeddings: `127.0.0.1:18817` +- Prototype ports currently reserved/not live: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, optional doc/image triage `:18829` + +Local NPU runtime snapshot from the feasibility run: + +- `/home/will/.venvs/npu` has `openvino==2026.2.0` and `openvino-genai==2026.2.0.0`. +- `openvino.Core().available_devices` reports `CPU`, `GPU.0`, `GPU.1`, and `NPU`. +- NPU device name: `Intel(R) AI Boost`. +- NPU claims must be verified by positive `/sys/class/accel/accel0/device/npu_busy_time_us` deltas around inference. + +External release/project signals checked: + +- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path. +- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2. +- `openWakeWord` is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv. + +## Recommendation summary + +| Lane | Recommendation | Priority | Why | +| --- | --- | --- | --- | +| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. | +| Lightweight image classification / caption fallback | Extend the existing `openvino-doc-image-triage-npu` lane before adding a new service. | Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. | +| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. | +| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. | + +## VLM / image-captioning path + +### Recommended model/runtime + +Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first. + +Candidate models to evaluate, in order: + +1. `Qwen2-VL-2B-Instruct` OpenVINO/OVMS-compatible export if a small converted artifact is already available. +2. `Phi-3.5-Vision-Instruct` only if memory/startup is acceptable. +3. `InternVL2` only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar. + +Why this order: + +- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants. +- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier. +- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas. + +### Endpoint/CLI contract + +CLI-first contract: + +```bash +python vlm_caption.py \ + --image /path/to/synthetic_or_explicitly_allowed_image.png \ + --prompt "Describe this image in one sentence." \ + --device CPU \ + --max-new-tokens 96 \ + --json +``` + +Response shape: + +```json +{ + "ok": true, + "media_type": "image", + "source_path_basename": "synthetic_scene.png", + "source_sha256": "sha256:...", + "model": "qwen2-vl-small-openvino", + "runtime": "openvino-genai-or-ovms", + "device_requested": "CPU", + "device_observed": "CPU", + "caption": "A synthetic chart with three colored bars.", + "safety": { + "external_uploads": false, + "raw_image_logged": false, + "private_paths_allowed": false + }, + "timing_ms": { + "load": 0, + "inference": 0, + "total": 0 + }, + "npu_busy_delta_us": null +} +``` + +Optional localhost HTTP contract, only after CLI is stable: + +- Bind: `127.0.0.1:18829` or another explicitly approved unused prototype port. +- `GET /healthz` +- `GET /models` +- `POST /v1/vision/caption` + +Request body: + +```json +{ + "path": "/allowed/root/synthetic_scene.png", + "prompt": "Describe this image in one sentence.", + "max_new_tokens": 96, + "device": "CPU" +} +``` + +### Smoke-test plan using non-private data + +Use only generated fixtures under the repo, similar to `openvino-doc-image-triage-npu/samples/`: + +1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image. +2. Run CLI with `--allowed-root "$PWD/samples"` and assert: + - JSON parses. + - `external_uploads=false`. + - only basename and SHA-256 are returned by default. + - captions are non-empty and under a configured token/character limit. + - unsupported/private paths are rejected. +3. If an HTTP server is added, start it in foreground on `127.0.0.1`, call `/healthz` and `/v1/vision/caption`, then stop it. +4. No private image/document folders and no Obsidian vault content should be used for smoke tests. + +### NPU busy-time verification plan + +Only claim NPU VLM if all of these pass: + +1. Verify the counter is readable: + +```bash +BUSY=/sys/class/accel/accel0/device/npu_busy_time_us +test -r "$BUSY" && before=$(cat "$BUSY") +``` + +2. Run exactly one synthetic-image inference with `device=NPU`. +3. Read `after=$(cat "$BUSY")`. +4. Require `after - before > 0` and a response-level `npu_busy_delta_us > 0` if the server reports it. +5. Repeat with a second synthetic image to avoid counting unrelated startup activity only. +6. If HTTP returns 200 but the sysfs delta is zero, document as `NPU not verified` and do not call it an NPU service. + +### No-go / defer criteria + +Defer VLM NPU work if any apply: + +- Model export/compile to NPU fails or requires unsupported ops/custom patches. +- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image. +- NPU busy-time delta is zero or inconsistent. +- Memory pressure disrupts Whisper `:18816`, embeddings `:18817`, or RAG `:18810`. +- The only useful path requires processing private images/docs before synthetic smoke tests are stable. +- Captions are too hallucination-prone for automation decisions without a human-review gate. + +## Lightweight image triage/classification path + +### Recommended model/runtime + +Recommended near-term path: keep `openvino-doc-image-triage-npu` as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate. + +Candidate classifier families for a later task: + +- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR. +- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas. +- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase. + +Why: + +- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages. +- A small classifier is much lower risk than a VLM and can be used for labels like `screenshot`, `receipt`, `document`, `photo`, `chart`. + +### Endpoint/CLI contract + +Extend existing CLI shape rather than introduce a new daemon: + +```bash +/home/will/.venvs/npu/bin/python triage.py \ + --allowed-root "$PWD" \ + --image-classifier-model /home/will/models/openvino-image-classifier/model.xml \ + --image-classifier-device NPU \ + --pretty \ + samples/synthetic_invoice.png +``` + +Response addition: + +```json +{ + "classification": { + "label": "receipt_or_invoice", + "confidence": 0.82, + "device": "NPU", + "method": "openvino_image_classifier", + "npu_busy_delta_us": 12345 + } +} +``` + +### Smoke-test plan + +Reuse `openvino-doc-image-triage-npu/make_samples.py` and `tests/smoke_test.py`; add synthetic image-label assertions only after a classifier model exists. Keep `--no-embeddings` mode available so the smoke suite can separate classifier NPU proof from embeddings `:18817` proof. + +### No-go / defer criteria + +- Static-shape classifier cannot compile on NPU. +- Labels are not useful enough to drive an assistant workflow. +- Classifier output duplicates the existing rule-based fallback. + +## Audio classification path + +### Recommended model/runtime + +Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service. + +Potential model classes: + +- Speech Commands keyword classifier for short command categories. +- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection. +- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU `:18816`. + +Why: + +- The system already has local Whisper NPU transcription. +- Generic audio classification needs careful threshold tuning and false-positive analysis. +- Always-on audio processing has privacy and resource implications; keep it explicit and local. + +### CLI contract + +```bash +python audio_classify.py \ + --input samples/synthetic_chime.wav \ + --model /home/will/models/openvino-audio-classifier/model.xml \ + --device CPU \ + --json +``` + +Response shape: + +```json +{ + "ok": true, + "source_path_basename": "synthetic_chime.wav", + "source_sha256": "sha256:...", + "sample_rate": 16000, + "duration_seconds": 1.2, + "labels": [ + {"label": "chime", "confidence": 0.76} + ], + "device_requested": "CPU", + "device_observed": "CPU", + "npu_busy_delta_us": null, + "privacy": {"external_uploads": false, "raw_audio_logged": false} +} +``` + +Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports. + +### Smoke-test plan using non-private data + +1. Generate synthetic WAV files in repo-local `samples/`: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available. +2. Run CLI on each file with `--allowed-root "$PWD/samples"`. +3. Assert JSON parses, durations are bounded, and confidence values are numeric. +4. Do not stream microphone input or scan private audio directories in smoke tests. +5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads. + +### No-go / defer criteria + +- No concrete downstream automation consumes the labels. +- False positives cannot be characterized on synthetic/public fixtures. +- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval. + +## Wake-word path + +### Recommended model/runtime + +Recommended first runtime: CPU-only `openWakeWord` CLI/foreground process with ONNX Runtime or TFLite backend. + +NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks. + +Why: + +- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty. +- The model is small enough that CPU is likely acceptable and simpler. +- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings. +- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs. + +### Endpoint/CLI contract + +CLI smoke contract: + +```bash +python wake_word_smoke.py \ + --model hey_jarvis \ + --positive samples/synthetic_wake_positive.wav \ + --negative samples/synthetic_noise.wav \ + --threshold 0.5 \ + --json +``` + +Foreground local stream contract, only for manual experiments: + +```bash +python wake_word_listen.py \ + --model hey_jarvis \ + --threshold 0.5 \ + --vad-threshold 0.3 \ + --oneshot \ + --json +``` + +Response/event shape: + +```json +{ + "ok": true, + "model": "hey_jarvis", + "runtime": "openwakeword-onnxruntime-or-tflite", + "device": "CPU", + "threshold": 0.5, + "events": [ + {"offset_ms": 1280, "score": 0.83, "detected": true} + ], + "false_positive_count": 0, + "npu_busy_delta_us": null, + "privacy": {"external_uploads": false, "raw_audio_logged": false} +} +``` + +If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only: + +- `GET /healthz` +- `POST /v1/wakeword/evaluate-file` for explicit files under allowed roots +- `GET /v1/wakeword/events` for a manually started foreground listener + +### Smoke-test plan using non-private data + +1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved: + +```bash +python -m venv /tmp/openwakeword-smoke-venv +/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword +``` + +2. Use public/generated WAVs only: + - Negative: silence, white noise, generic non-wake speech/TTS if locally generated. + - Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested. +3. Assert no false positives over a bounded negative fixture set. +4. Measure per-frame CPU latency and max RSS. +5. Do not start a persistent microphone listener; manual foreground `--oneshot` only if explicitly approved. + +### NPU busy-time verification plan + +Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU: + +1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before a bounded file evaluation. +2. Run NPU inference on a fixed set of WAV frames. +3. Read the counter after inference. +4. Require positive delta and stable predictions matching CPU baseline. +5. Also verify that keeping the wake-word loop active does not starve Whisper `:18816` or embeddings `:18817`. + +### No-go / defer criteria + +- CPU openWakeWord has unacceptable false positives on local negative fixtures. +- A usable positive fixture cannot be created without recording private audio. +- Always-on microphone capture is required before explicit approval. +- NPU conversion changes scores materially from CPU baseline. +- NPU loop increases contention with Whisper/embedding services. + +## Docs and diagram implications + +If these lanes advance beyond feasibility: + +1. Update `docs/swarm-infrastructure.md` and `docs/swarm-infrastructure.html` to keep live vs prototype labels clear. +2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps. +3. Update the Service Catalog only after a service is actually approved/live; until then list as `prototype/not live` or omit. +4. Architecture diagrams may show: + - live: RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`; + - prototypes: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, doc/image triage optional `:18829`; + - VLM/audio/wake-word as `CLI feasibility / not live` unless a later implementation task creates a service. +5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval. + +## Overall go/no-go decision + +- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local. +- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough. +- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke. +- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.