docs(npu): document VLM audio wake-word feasibility

This commit is contained in:
William Valentin
2026-06-04 13:07:51 -07:00
parent 2ef9e3dfd2
commit 703c1df860
@@ -0,0 +1,388 @@
# OpenVINO/NPU VLM, audio, and wake-word feasibility
Date: 2026-06-04
Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.
## Existing baseline and constraints
Live baseline discovered by parent task:
- RAG endpoint: `127.0.0.1:18810`
- RAG health wrapper: `127.0.0.1:18814`
- Whisper OpenVINO NPU: `127.0.0.1:18816`
- OpenVINO embeddings: `127.0.0.1:18817`
- Prototype ports currently reserved/not live: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, optional doc/image triage `:18829`
Local NPU runtime snapshot from the feasibility run:
- `/home/will/.venvs/npu` has `openvino==2026.2.0` and `openvino-genai==2026.2.0.0`.
- `openvino.Core().available_devices` reports `CPU`, `GPU.0`, `GPU.1`, and `NPU`.
- NPU device name: `Intel(R) AI Boost`.
- NPU claims must be verified by positive `/sys/class/accel/accel0/device/npu_busy_time_us` deltas around inference.
External release/project signals checked:
- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
- `openWakeWord` is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.
## Recommendation summary
| Lane | Recommendation | Priority | Why |
| --- | --- | --- | --- |
| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. |
| Lightweight image classification / caption fallback | Extend the existing `openvino-doc-image-triage-npu` lane before adding a new service. | Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. |
| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. |
| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. |
## VLM / image-captioning path
### Recommended model/runtime
Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.
Candidate models to evaluate, in order:
1. `Qwen2-VL-2B-Instruct` OpenVINO/OVMS-compatible export if a small converted artifact is already available.
2. `Phi-3.5-Vision-Instruct` only if memory/startup is acceptable.
3. `InternVL2` only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.
Why this order:
- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.
### Endpoint/CLI contract
CLI-first contract:
```bash
python vlm_caption.py \
--image /path/to/synthetic_or_explicitly_allowed_image.png \
--prompt "Describe this image in one sentence." \
--device CPU \
--max-new-tokens 96 \
--json
```
Response shape:
```json
{
"ok": true,
"media_type": "image",
"source_path_basename": "synthetic_scene.png",
"source_sha256": "sha256:...",
"model": "qwen2-vl-small-openvino",
"runtime": "openvino-genai-or-ovms",
"device_requested": "CPU",
"device_observed": "CPU",
"caption": "A synthetic chart with three colored bars.",
"safety": {
"external_uploads": false,
"raw_image_logged": false,
"private_paths_allowed": false
},
"timing_ms": {
"load": 0,
"inference": 0,
"total": 0
},
"npu_busy_delta_us": null
}
```
Optional localhost HTTP contract, only after CLI is stable:
- Bind: `127.0.0.1:18829` or another explicitly approved unused prototype port.
- `GET /healthz`
- `GET /models`
- `POST /v1/vision/caption`
Request body:
```json
{
"path": "/allowed/root/synthetic_scene.png",
"prompt": "Describe this image in one sentence.",
"max_new_tokens": 96,
"device": "CPU"
}
```
### Smoke-test plan using non-private data
Use only generated fixtures under the repo, similar to `openvino-doc-image-triage-npu/samples/`:
1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
2. Run CLI with `--allowed-root "$PWD/samples"` and assert:
- JSON parses.
- `external_uploads=false`.
- only basename and SHA-256 are returned by default.
- captions are non-empty and under a configured token/character limit.
- unsupported/private paths are rejected.
3. If an HTTP server is added, start it in foreground on `127.0.0.1`, call `/healthz` and `/v1/vision/caption`, then stop it.
4. No private image/document folders and no Obsidian vault content should be used for smoke tests.
### NPU busy-time verification plan
Only claim NPU VLM if all of these pass:
1. Verify the counter is readable:
```bash
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
test -r "$BUSY" && before=$(cat "$BUSY")
```
2. Run exactly one synthetic-image inference with `device=NPU`.
3. Read `after=$(cat "$BUSY")`.
4. Require `after - before > 0` and a response-level `npu_busy_delta_us > 0` if the server reports it.
5. Repeat with a second synthetic image to avoid counting unrelated startup activity only.
6. If HTTP returns 200 but the sysfs delta is zero, document as `NPU not verified` and do not call it an NPU service.
### No-go / defer criteria
Defer VLM NPU work if any apply:
- Model export/compile to NPU fails or requires unsupported ops/custom patches.
- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
- NPU busy-time delta is zero or inconsistent.
- Memory pressure disrupts Whisper `:18816`, embeddings `:18817`, or RAG `:18810`.
- The only useful path requires processing private images/docs before synthetic smoke tests are stable.
- Captions are too hallucination-prone for automation decisions without a human-review gate.
## Lightweight image triage/classification path
### Recommended model/runtime
Recommended near-term path: keep `openvino-doc-image-triage-npu` as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.
Candidate classifier families for a later task:
- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.
Why:
- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
- A small classifier is much lower risk than a VLM and can be used for labels like `screenshot`, `receipt`, `document`, `photo`, `chart`.
### Endpoint/CLI contract
Extend existing CLI shape rather than introduce a new daemon:
```bash
/home/will/.venvs/npu/bin/python triage.py \
--allowed-root "$PWD" \
--image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
--image-classifier-device NPU \
--pretty \
samples/synthetic_invoice.png
```
Response addition:
```json
{
"classification": {
"label": "receipt_or_invoice",
"confidence": 0.82,
"device": "NPU",
"method": "openvino_image_classifier",
"npu_busy_delta_us": 12345
}
}
```
### Smoke-test plan
Reuse `openvino-doc-image-triage-npu/make_samples.py` and `tests/smoke_test.py`; add synthetic image-label assertions only after a classifier model exists. Keep `--no-embeddings` mode available so the smoke suite can separate classifier NPU proof from embeddings `:18817` proof.
### No-go / defer criteria
- Static-shape classifier cannot compile on NPU.
- Labels are not useful enough to drive an assistant workflow.
- Classifier output duplicates the existing rule-based fallback.
## Audio classification path
### Recommended model/runtime
Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.
Potential model classes:
- Speech Commands keyword classifier for short command categories.
- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU `:18816`.
Why:
- The system already has local Whisper NPU transcription.
- Generic audio classification needs careful threshold tuning and false-positive analysis.
- Always-on audio processing has privacy and resource implications; keep it explicit and local.
### CLI contract
```bash
python audio_classify.py \
--input samples/synthetic_chime.wav \
--model /home/will/models/openvino-audio-classifier/model.xml \
--device CPU \
--json
```
Response shape:
```json
{
"ok": true,
"source_path_basename": "synthetic_chime.wav",
"source_sha256": "sha256:...",
"sample_rate": 16000,
"duration_seconds": 1.2,
"labels": [
{"label": "chime", "confidence": 0.76}
],
"device_requested": "CPU",
"device_observed": "CPU",
"npu_busy_delta_us": null,
"privacy": {"external_uploads": false, "raw_audio_logged": false}
}
```
Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.
### Smoke-test plan using non-private data
1. Generate synthetic WAV files in repo-local `samples/`: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available.
2. Run CLI on each file with `--allowed-root "$PWD/samples"`.
3. Assert JSON parses, durations are bounded, and confidence values are numeric.
4. Do not stream microphone input or scan private audio directories in smoke tests.
5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads.
### No-go / defer criteria
- No concrete downstream automation consumes the labels.
- False positives cannot be characterized on synthetic/public fixtures.
- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.
## Wake-word path
### Recommended model/runtime
Recommended first runtime: CPU-only `openWakeWord` CLI/foreground process with ONNX Runtime or TFLite backend.
NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.
Why:
- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
- The model is small enough that CPU is likely acceptable and simpler.
- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.
### Endpoint/CLI contract
CLI smoke contract:
```bash
python wake_word_smoke.py \
--model hey_jarvis \
--positive samples/synthetic_wake_positive.wav \
--negative samples/synthetic_noise.wav \
--threshold 0.5 \
--json
```
Foreground local stream contract, only for manual experiments:
```bash
python wake_word_listen.py \
--model hey_jarvis \
--threshold 0.5 \
--vad-threshold 0.3 \
--oneshot \
--json
```
Response/event shape:
```json
{
"ok": true,
"model": "hey_jarvis",
"runtime": "openwakeword-onnxruntime-or-tflite",
"device": "CPU",
"threshold": 0.5,
"events": [
{"offset_ms": 1280, "score": 0.83, "detected": true}
],
"false_positive_count": 0,
"npu_busy_delta_us": null,
"privacy": {"external_uploads": false, "raw_audio_logged": false}
}
```
If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:
- `GET /healthz`
- `POST /v1/wakeword/evaluate-file` for explicit files under allowed roots
- `GET /v1/wakeword/events` for a manually started foreground listener
### Smoke-test plan using non-private data
1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
```bash
python -m venv /tmp/openwakeword-smoke-venv
/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
```
2. Use public/generated WAVs only:
- Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
- Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
3. Assert no false positives over a bounded negative fixture set.
4. Measure per-frame CPU latency and max RSS.
5. Do not start a persistent microphone listener; manual foreground `--oneshot` only if explicitly approved.
### NPU busy-time verification plan
Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:
1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before a bounded file evaluation.
2. Run NPU inference on a fixed set of WAV frames.
3. Read the counter after inference.
4. Require positive delta and stable predictions matching CPU baseline.
5. Also verify that keeping the wake-word loop active does not starve Whisper `:18816` or embeddings `:18817`.
### No-go / defer criteria
- CPU openWakeWord has unacceptable false positives on local negative fixtures.
- A usable positive fixture cannot be created without recording private audio.
- Always-on microphone capture is required before explicit approval.
- NPU conversion changes scores materially from CPU baseline.
- NPU loop increases contention with Whisper/embedding services.
## Docs and diagram implications
If these lanes advance beyond feasibility:
1. Update `docs/swarm-infrastructure.md` and `docs/swarm-infrastructure.html` to keep live vs prototype labels clear.
2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
3. Update the Service Catalog only after a service is actually approved/live; until then list as `prototype/not live` or omit.
4. Architecture diagrams may show:
- live: RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`;
- prototypes: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, doc/image triage optional `:18829`;
- VLM/audio/wake-word as `CLI feasibility / not live` unless a later implementation task creates a service.
5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.
## Overall go/no-go decision
- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.