docs(npu): document VLM audio wake-word feasibility
This commit is contained in:
@@ -0,0 +1,388 @@
|
||||
# OpenVINO/NPU VLM, audio, and wake-word feasibility
|
||||
|
||||
Date: 2026-06-04
|
||||
Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.
|
||||
|
||||
## Existing baseline and constraints
|
||||
|
||||
Live baseline discovered by parent task:
|
||||
|
||||
- RAG endpoint: `127.0.0.1:18810`
|
||||
- RAG health wrapper: `127.0.0.1:18814`
|
||||
- Whisper OpenVINO NPU: `127.0.0.1:18816`
|
||||
- OpenVINO embeddings: `127.0.0.1:18817`
|
||||
- Prototype ports currently reserved/not live: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, optional doc/image triage `:18829`
|
||||
|
||||
Local NPU runtime snapshot from the feasibility run:
|
||||
|
||||
- `/home/will/.venvs/npu` has `openvino==2026.2.0` and `openvino-genai==2026.2.0.0`.
|
||||
- `openvino.Core().available_devices` reports `CPU`, `GPU.0`, `GPU.1`, and `NPU`.
|
||||
- NPU device name: `Intel(R) AI Boost`.
|
||||
- NPU claims must be verified by positive `/sys/class/accel/accel0/device/npu_busy_time_us` deltas around inference.
|
||||
|
||||
External release/project signals checked:
|
||||
|
||||
- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
|
||||
- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
|
||||
- `openWakeWord` is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.
|
||||
|
||||
## Recommendation summary
|
||||
|
||||
| Lane | Recommendation | Priority | Why |
|
||||
| --- | --- | --- | --- |
|
||||
| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. |
|
||||
| Lightweight image classification / caption fallback | Extend the existing `openvino-doc-image-triage-npu` lane before adding a new service. | Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. |
|
||||
| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. |
|
||||
| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. |
|
||||
|
||||
## VLM / image-captioning path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.
|
||||
|
||||
Candidate models to evaluate, in order:
|
||||
|
||||
1. `Qwen2-VL-2B-Instruct` OpenVINO/OVMS-compatible export if a small converted artifact is already available.
|
||||
2. `Phi-3.5-Vision-Instruct` only if memory/startup is acceptable.
|
||||
3. `InternVL2` only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.
|
||||
|
||||
Why this order:
|
||||
|
||||
- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
|
||||
- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
|
||||
- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
CLI-first contract:
|
||||
|
||||
```bash
|
||||
python vlm_caption.py \
|
||||
--image /path/to/synthetic_or_explicitly_allowed_image.png \
|
||||
--prompt "Describe this image in one sentence." \
|
||||
--device CPU \
|
||||
--max-new-tokens 96 \
|
||||
--json
|
||||
```
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"media_type": "image",
|
||||
"source_path_basename": "synthetic_scene.png",
|
||||
"source_sha256": "sha256:...",
|
||||
"model": "qwen2-vl-small-openvino",
|
||||
"runtime": "openvino-genai-or-ovms",
|
||||
"device_requested": "CPU",
|
||||
"device_observed": "CPU",
|
||||
"caption": "A synthetic chart with three colored bars.",
|
||||
"safety": {
|
||||
"external_uploads": false,
|
||||
"raw_image_logged": false,
|
||||
"private_paths_allowed": false
|
||||
},
|
||||
"timing_ms": {
|
||||
"load": 0,
|
||||
"inference": 0,
|
||||
"total": 0
|
||||
},
|
||||
"npu_busy_delta_us": null
|
||||
}
|
||||
```
|
||||
|
||||
Optional localhost HTTP contract, only after CLI is stable:
|
||||
|
||||
- Bind: `127.0.0.1:18829` or another explicitly approved unused prototype port.
|
||||
- `GET /healthz`
|
||||
- `GET /models`
|
||||
- `POST /v1/vision/caption`
|
||||
|
||||
Request body:
|
||||
|
||||
```json
|
||||
{
|
||||
"path": "/allowed/root/synthetic_scene.png",
|
||||
"prompt": "Describe this image in one sentence.",
|
||||
"max_new_tokens": 96,
|
||||
"device": "CPU"
|
||||
}
|
||||
```
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
Use only generated fixtures under the repo, similar to `openvino-doc-image-triage-npu/samples/`:
|
||||
|
||||
1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
|
||||
2. Run CLI with `--allowed-root "$PWD/samples"` and assert:
|
||||
- JSON parses.
|
||||
- `external_uploads=false`.
|
||||
- only basename and SHA-256 are returned by default.
|
||||
- captions are non-empty and under a configured token/character limit.
|
||||
- unsupported/private paths are rejected.
|
||||
3. If an HTTP server is added, start it in foreground on `127.0.0.1`, call `/healthz` and `/v1/vision/caption`, then stop it.
|
||||
4. No private image/document folders and no Obsidian vault content should be used for smoke tests.
|
||||
|
||||
### NPU busy-time verification plan
|
||||
|
||||
Only claim NPU VLM if all of these pass:
|
||||
|
||||
1. Verify the counter is readable:
|
||||
|
||||
```bash
|
||||
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
||||
test -r "$BUSY" && before=$(cat "$BUSY")
|
||||
```
|
||||
|
||||
2. Run exactly one synthetic-image inference with `device=NPU`.
|
||||
3. Read `after=$(cat "$BUSY")`.
|
||||
4. Require `after - before > 0` and a response-level `npu_busy_delta_us > 0` if the server reports it.
|
||||
5. Repeat with a second synthetic image to avoid counting unrelated startup activity only.
|
||||
6. If HTTP returns 200 but the sysfs delta is zero, document as `NPU not verified` and do not call it an NPU service.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
Defer VLM NPU work if any apply:
|
||||
|
||||
- Model export/compile to NPU fails or requires unsupported ops/custom patches.
|
||||
- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
|
||||
- NPU busy-time delta is zero or inconsistent.
|
||||
- Memory pressure disrupts Whisper `:18816`, embeddings `:18817`, or RAG `:18810`.
|
||||
- The only useful path requires processing private images/docs before synthetic smoke tests are stable.
|
||||
- Captions are too hallucination-prone for automation decisions without a human-review gate.
|
||||
|
||||
## Lightweight image triage/classification path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Recommended near-term path: keep `openvino-doc-image-triage-npu` as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.
|
||||
|
||||
Candidate classifier families for a later task:
|
||||
|
||||
- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
|
||||
- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
|
||||
- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.
|
||||
|
||||
Why:
|
||||
|
||||
- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
|
||||
- A small classifier is much lower risk than a VLM and can be used for labels like `screenshot`, `receipt`, `document`, `photo`, `chart`.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
Extend existing CLI shape rather than introduce a new daemon:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python triage.py \
|
||||
--allowed-root "$PWD" \
|
||||
--image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
|
||||
--image-classifier-device NPU \
|
||||
--pretty \
|
||||
samples/synthetic_invoice.png
|
||||
```
|
||||
|
||||
Response addition:
|
||||
|
||||
```json
|
||||
{
|
||||
"classification": {
|
||||
"label": "receipt_or_invoice",
|
||||
"confidence": 0.82,
|
||||
"device": "NPU",
|
||||
"method": "openvino_image_classifier",
|
||||
"npu_busy_delta_us": 12345
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Smoke-test plan
|
||||
|
||||
Reuse `openvino-doc-image-triage-npu/make_samples.py` and `tests/smoke_test.py`; add synthetic image-label assertions only after a classifier model exists. Keep `--no-embeddings` mode available so the smoke suite can separate classifier NPU proof from embeddings `:18817` proof.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- Static-shape classifier cannot compile on NPU.
|
||||
- Labels are not useful enough to drive an assistant workflow.
|
||||
- Classifier output duplicates the existing rule-based fallback.
|
||||
|
||||
## Audio classification path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.
|
||||
|
||||
Potential model classes:
|
||||
|
||||
- Speech Commands keyword classifier for short command categories.
|
||||
- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
|
||||
- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU `:18816`.
|
||||
|
||||
Why:
|
||||
|
||||
- The system already has local Whisper NPU transcription.
|
||||
- Generic audio classification needs careful threshold tuning and false-positive analysis.
|
||||
- Always-on audio processing has privacy and resource implications; keep it explicit and local.
|
||||
|
||||
### CLI contract
|
||||
|
||||
```bash
|
||||
python audio_classify.py \
|
||||
--input samples/synthetic_chime.wav \
|
||||
--model /home/will/models/openvino-audio-classifier/model.xml \
|
||||
--device CPU \
|
||||
--json
|
||||
```
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"source_path_basename": "synthetic_chime.wav",
|
||||
"source_sha256": "sha256:...",
|
||||
"sample_rate": 16000,
|
||||
"duration_seconds": 1.2,
|
||||
"labels": [
|
||||
{"label": "chime", "confidence": 0.76}
|
||||
],
|
||||
"device_requested": "CPU",
|
||||
"device_observed": "CPU",
|
||||
"npu_busy_delta_us": null,
|
||||
"privacy": {"external_uploads": false, "raw_audio_logged": false}
|
||||
}
|
||||
```
|
||||
|
||||
Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
1. Generate synthetic WAV files in repo-local `samples/`: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available.
|
||||
2. Run CLI on each file with `--allowed-root "$PWD/samples"`.
|
||||
3. Assert JSON parses, durations are bounded, and confidence values are numeric.
|
||||
4. Do not stream microphone input or scan private audio directories in smoke tests.
|
||||
5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- No concrete downstream automation consumes the labels.
|
||||
- False positives cannot be characterized on synthetic/public fixtures.
|
||||
- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.
|
||||
|
||||
## Wake-word path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Recommended first runtime: CPU-only `openWakeWord` CLI/foreground process with ONNX Runtime or TFLite backend.
|
||||
|
||||
NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.
|
||||
|
||||
Why:
|
||||
|
||||
- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
|
||||
- The model is small enough that CPU is likely acceptable and simpler.
|
||||
- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
|
||||
- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
CLI smoke contract:
|
||||
|
||||
```bash
|
||||
python wake_word_smoke.py \
|
||||
--model hey_jarvis \
|
||||
--positive samples/synthetic_wake_positive.wav \
|
||||
--negative samples/synthetic_noise.wav \
|
||||
--threshold 0.5 \
|
||||
--json
|
||||
```
|
||||
|
||||
Foreground local stream contract, only for manual experiments:
|
||||
|
||||
```bash
|
||||
python wake_word_listen.py \
|
||||
--model hey_jarvis \
|
||||
--threshold 0.5 \
|
||||
--vad-threshold 0.3 \
|
||||
--oneshot \
|
||||
--json
|
||||
```
|
||||
|
||||
Response/event shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"model": "hey_jarvis",
|
||||
"runtime": "openwakeword-onnxruntime-or-tflite",
|
||||
"device": "CPU",
|
||||
"threshold": 0.5,
|
||||
"events": [
|
||||
{"offset_ms": 1280, "score": 0.83, "detected": true}
|
||||
],
|
||||
"false_positive_count": 0,
|
||||
"npu_busy_delta_us": null,
|
||||
"privacy": {"external_uploads": false, "raw_audio_logged": false}
|
||||
}
|
||||
```
|
||||
|
||||
If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:
|
||||
|
||||
- `GET /healthz`
|
||||
- `POST /v1/wakeword/evaluate-file` for explicit files under allowed roots
|
||||
- `GET /v1/wakeword/events` for a manually started foreground listener
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
|
||||
|
||||
```bash
|
||||
python -m venv /tmp/openwakeword-smoke-venv
|
||||
/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
|
||||
```
|
||||
|
||||
2. Use public/generated WAVs only:
|
||||
- Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
|
||||
- Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
|
||||
3. Assert no false positives over a bounded negative fixture set.
|
||||
4. Measure per-frame CPU latency and max RSS.
|
||||
5. Do not start a persistent microphone listener; manual foreground `--oneshot` only if explicitly approved.
|
||||
|
||||
### NPU busy-time verification plan
|
||||
|
||||
Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:
|
||||
|
||||
1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before a bounded file evaluation.
|
||||
2. Run NPU inference on a fixed set of WAV frames.
|
||||
3. Read the counter after inference.
|
||||
4. Require positive delta and stable predictions matching CPU baseline.
|
||||
5. Also verify that keeping the wake-word loop active does not starve Whisper `:18816` or embeddings `:18817`.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- CPU openWakeWord has unacceptable false positives on local negative fixtures.
|
||||
- A usable positive fixture cannot be created without recording private audio.
|
||||
- Always-on microphone capture is required before explicit approval.
|
||||
- NPU conversion changes scores materially from CPU baseline.
|
||||
- NPU loop increases contention with Whisper/embedding services.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
If these lanes advance beyond feasibility:
|
||||
|
||||
1. Update `docs/swarm-infrastructure.md` and `docs/swarm-infrastructure.html` to keep live vs prototype labels clear.
|
||||
2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
|
||||
3. Update the Service Catalog only after a service is actually approved/live; until then list as `prototype/not live` or omit.
|
||||
4. Architecture diagrams may show:
|
||||
- live: RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`;
|
||||
- prototypes: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, doc/image triage optional `:18829`;
|
||||
- VLM/audio/wake-word as `CLI feasibility / not live` unless a later implementation task creates a service.
|
||||
5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.
|
||||
|
||||
## Overall go/no-go decision
|
||||
|
||||
- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
|
||||
- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
|
||||
- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
|
||||
- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.
|
||||
Reference in New Issue
Block a user