16 KiB
OpenVINO/NPU VLM, audio, and wake-word feasibility
Date: 2026-06-04 Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.
Existing baseline and constraints
Live baseline discovered by parent task:
- RAG endpoint:
127.0.0.1:18810 - RAG health wrapper:
127.0.0.1:18814 - Whisper OpenVINO NPU:
127.0.0.1:18816 - OpenVINO embeddings:
127.0.0.1:18817 - Prototype ports currently reserved/not live: reranker
:18818, classifier/router:18819, GenAI worker:18820, optional doc/image triage:18829
Local NPU runtime snapshot from the feasibility run:
/home/will/.venvs/npuhasopenvino==2026.2.0andopenvino-genai==2026.2.0.0.openvino.Core().available_devicesreportsCPU,GPU.0,GPU.1, andNPU.- NPU device name:
Intel(R) AI Boost. - NPU claims must be verified by positive
/sys/class/accel/accel0/device/npu_busy_time_usdeltas around inference.
External release/project signals checked:
- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
openWakeWordis an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.
Recommendation summary
| Lane | Recommendation | Priority | Why |
|---|---|---|---|
| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. |
| Lightweight image classification / caption fallback | Extend the existing openvino-doc-image-triage-npu lane before adding a new service. |
Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. |
| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. |
| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. |
VLM / image-captioning path
Recommended model/runtime
Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.
Candidate models to evaluate, in order:
Qwen2-VL-2B-InstructOpenVINO/OVMS-compatible export if a small converted artifact is already available.Phi-3.5-Vision-Instructonly if memory/startup is acceptable.InternVL2only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.
Why this order:
- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.
Endpoint/CLI contract
CLI-first contract:
python vlm_caption.py \
--image /path/to/synthetic_or_explicitly_allowed_image.png \
--prompt "Describe this image in one sentence." \
--device CPU \
--max-new-tokens 96 \
--json
Response shape:
{
"ok": true,
"media_type": "image",
"source_path_basename": "synthetic_scene.png",
"source_sha256": "sha256:...",
"model": "qwen2-vl-small-openvino",
"runtime": "openvino-genai-or-ovms",
"device_requested": "CPU",
"device_observed": "CPU",
"caption": "A synthetic chart with three colored bars.",
"safety": {
"external_uploads": false,
"raw_image_logged": false,
"private_paths_allowed": false
},
"timing_ms": {
"load": 0,
"inference": 0,
"total": 0
},
"npu_busy_delta_us": null
}
Optional localhost HTTP contract, only after CLI is stable:
- Bind:
127.0.0.1:18829or another explicitly approved unused prototype port. GET /healthzGET /modelsPOST /v1/vision/caption
Request body:
{
"path": "/allowed/root/synthetic_scene.png",
"prompt": "Describe this image in one sentence.",
"max_new_tokens": 96,
"device": "CPU"
}
Smoke-test plan using non-private data
Use only generated fixtures under the repo, similar to openvino-doc-image-triage-npu/samples/:
- Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
- Run CLI with
--allowed-root "$PWD/samples"and assert:- JSON parses.
external_uploads=false.- only basename and SHA-256 are returned by default.
- captions are non-empty and under a configured token/character limit.
- unsupported/private paths are rejected.
- If an HTTP server is added, start it in foreground on
127.0.0.1, call/healthzand/v1/vision/caption, then stop it. - No private image/document folders and no Obsidian vault content should be used for smoke tests.
NPU busy-time verification plan
Only claim NPU VLM if all of these pass:
- Verify the counter is readable:
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
test -r "$BUSY" && before=$(cat "$BUSY")
- Run exactly one synthetic-image inference with
device=NPU. - Read
after=$(cat "$BUSY"). - Require
after - before > 0and a response-levelnpu_busy_delta_us > 0if the server reports it. - Repeat with a second synthetic image to avoid counting unrelated startup activity only.
- If HTTP returns 200 but the sysfs delta is zero, document as
NPU not verifiedand do not call it an NPU service.
No-go / defer criteria
Defer VLM NPU work if any apply:
- Model export/compile to NPU fails or requires unsupported ops/custom patches.
- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
- NPU busy-time delta is zero or inconsistent.
- Memory pressure disrupts Whisper
:18816, embeddings:18817, or RAG:18810. - The only useful path requires processing private images/docs before synthetic smoke tests are stable.
- Captions are too hallucination-prone for automation decisions without a human-review gate.
Lightweight image triage/classification path
Recommended model/runtime
Recommended near-term path: keep openvino-doc-image-triage-npu as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.
Candidate classifier families for a later task:
- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.
Why:
- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
- A small classifier is much lower risk than a VLM and can be used for labels like
screenshot,receipt,document,photo,chart.
Endpoint/CLI contract
Extend existing CLI shape rather than introduce a new daemon:
/home/will/.venvs/npu/bin/python triage.py \
--allowed-root "$PWD" \
--image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
--image-classifier-device NPU \
--pretty \
samples/synthetic_invoice.png
Response addition:
{
"classification": {
"label": "receipt_or_invoice",
"confidence": 0.82,
"device": "NPU",
"method": "openvino_image_classifier",
"npu_busy_delta_us": 12345
}
}
Smoke-test plan
Reuse openvino-doc-image-triage-npu/make_samples.py and tests/smoke_test.py; add synthetic image-label assertions only after a classifier model exists. Keep --no-embeddings mode available so the smoke suite can separate classifier NPU proof from embeddings :18817 proof.
No-go / defer criteria
- Static-shape classifier cannot compile on NPU.
- Labels are not useful enough to drive an assistant workflow.
- Classifier output duplicates the existing rule-based fallback.
Audio classification path
Recommended model/runtime
Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.
Potential model classes:
- Speech Commands keyword classifier for short command categories.
- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU
:18816.
Why:
- The system already has local Whisper NPU transcription.
- Generic audio classification needs careful threshold tuning and false-positive analysis.
- Always-on audio processing has privacy and resource implications; keep it explicit and local.
CLI contract
python audio_classify.py \
--input samples/synthetic_chime.wav \
--model /home/will/models/openvino-audio-classifier/model.xml \
--device CPU \
--json
Response shape:
{
"ok": true,
"source_path_basename": "synthetic_chime.wav",
"source_sha256": "sha256:...",
"sample_rate": 16000,
"duration_seconds": 1.2,
"labels": [
{"label": "chime", "confidence": 0.76}
],
"device_requested": "CPU",
"device_observed": "CPU",
"npu_busy_delta_us": null,
"privacy": {"external_uploads": false, "raw_audio_logged": false}
}
Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.
Smoke-test plan using non-private data
- Generate synthetic WAV files in repo-local
samples/: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available. - Run CLI on each file with
--allowed-root "$PWD/samples". - Assert JSON parses, durations are bounded, and confidence values are numeric.
- Do not stream microphone input or scan private audio directories in smoke tests.
- If NPU mode is attempted, wrap each inference in sysfs busy-time reads.
No-go / defer criteria
- No concrete downstream automation consumes the labels.
- False positives cannot be characterized on synthetic/public fixtures.
- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.
Wake-word path
Recommended model/runtime
Recommended first runtime: CPU-only openWakeWord CLI/foreground process with ONNX Runtime or TFLite backend.
NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.
Why:
- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
- The model is small enough that CPU is likely acceptable and simpler.
- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.
Endpoint/CLI contract
CLI smoke contract:
python wake_word_smoke.py \
--model hey_jarvis \
--positive samples/synthetic_wake_positive.wav \
--negative samples/synthetic_noise.wav \
--threshold 0.5 \
--json
Foreground local stream contract, only for manual experiments:
python wake_word_listen.py \
--model hey_jarvis \
--threshold 0.5 \
--vad-threshold 0.3 \
--oneshot \
--json
Response/event shape:
{
"ok": true,
"model": "hey_jarvis",
"runtime": "openwakeword-onnxruntime-or-tflite",
"device": "CPU",
"threshold": 0.5,
"events": [
{"offset_ms": 1280, "score": 0.83, "detected": true}
],
"false_positive_count": 0,
"npu_busy_delta_us": null,
"privacy": {"external_uploads": false, "raw_audio_logged": false}
}
If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:
GET /healthzPOST /v1/wakeword/evaluate-filefor explicit files under allowed rootsGET /v1/wakeword/eventsfor a manually started foreground listener
Smoke-test plan using non-private data
- Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
python -m venv /tmp/openwakeword-smoke-venv
/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
- Use public/generated WAVs only:
- Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
- Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
- Assert no false positives over a bounded negative fixture set.
- Measure per-frame CPU latency and max RSS.
- Do not start a persistent microphone listener; manual foreground
--oneshotonly if explicitly approved.
NPU busy-time verification plan
Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:
- Read
/sys/class/accel/accel0/device/npu_busy_time_usbefore a bounded file evaluation. - Run NPU inference on a fixed set of WAV frames.
- Read the counter after inference.
- Require positive delta and stable predictions matching CPU baseline.
- Also verify that keeping the wake-word loop active does not starve Whisper
:18816or embeddings:18817.
No-go / defer criteria
- CPU openWakeWord has unacceptable false positives on local negative fixtures.
- A usable positive fixture cannot be created without recording private audio.
- Always-on microphone capture is required before explicit approval.
- NPU conversion changes scores materially from CPU baseline.
- NPU loop increases contention with Whisper/embedding services.
Docs and diagram implications
If these lanes advance beyond feasibility:
- Update
docs/swarm-infrastructure.mdanddocs/swarm-infrastructure.htmlto keep live vs prototype labels clear. - Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
- Update the Service Catalog only after a service is actually approved/live; until then list as
prototype/not liveor omit. - Architecture diagrams may show:
- live: RAG
:18810, Whisper NPU:18816, embeddings:18817; - prototypes: reranker
:18818, classifier/router:18819, GenAI worker:18820, doc/image triage optional:18829; - VLM/audio/wake-word as
CLI feasibility / not liveunless a later implementation task creates a service.
- live: RAG
- Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.
Overall go/no-go decision
- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.