Files
swarm-master/docs/openvino-vlm-audio-wakeword-feasibility.md
T
2026-06-04 13:07:51 -07:00

16 KiB

OpenVINO/NPU VLM, audio, and wake-word feasibility

Date: 2026-06-04 Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.

Existing baseline and constraints

Live baseline discovered by parent task:

  • RAG endpoint: 127.0.0.1:18810
  • RAG health wrapper: 127.0.0.1:18814
  • Whisper OpenVINO NPU: 127.0.0.1:18816
  • OpenVINO embeddings: 127.0.0.1:18817
  • Prototype ports currently reserved/not live: reranker :18818, classifier/router :18819, GenAI worker :18820, optional doc/image triage :18829

Local NPU runtime snapshot from the feasibility run:

  • /home/will/.venvs/npu has openvino==2026.2.0 and openvino-genai==2026.2.0.0.
  • openvino.Core().available_devices reports CPU, GPU.0, GPU.1, and NPU.
  • NPU device name: Intel(R) AI Boost.
  • NPU claims must be verified by positive /sys/class/accel/accel0/device/npu_busy_time_us deltas around inference.

External release/project signals checked:

  • OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
  • Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
  • openWakeWord is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.

Recommendation summary

Lane Recommendation Priority Why
VLM / image captioning Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. Low NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM.
Lightweight image classification / caption fallback Extend the existing openvino-doc-image-triage-npu lane before adding a new service. Medium-low It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback.
Audio classification Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. Low Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning.
Wake word Worth a small CPU-only local smoke prototype; do not spend NPU time first. Medium Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable.

VLM / image-captioning path

Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.

Candidate models to evaluate, in order:

  1. Qwen2-VL-2B-Instruct OpenVINO/OVMS-compatible export if a small converted artifact is already available.
  2. Phi-3.5-Vision-Instruct only if memory/startup is acceptable.
  3. InternVL2 only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.

Why this order:

  • Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
  • Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
  • NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.

Endpoint/CLI contract

CLI-first contract:

python vlm_caption.py \
  --image /path/to/synthetic_or_explicitly_allowed_image.png \
  --prompt "Describe this image in one sentence." \
  --device CPU \
  --max-new-tokens 96 \
  --json

Response shape:

{
  "ok": true,
  "media_type": "image",
  "source_path_basename": "synthetic_scene.png",
  "source_sha256": "sha256:...",
  "model": "qwen2-vl-small-openvino",
  "runtime": "openvino-genai-or-ovms",
  "device_requested": "CPU",
  "device_observed": "CPU",
  "caption": "A synthetic chart with three colored bars.",
  "safety": {
    "external_uploads": false,
    "raw_image_logged": false,
    "private_paths_allowed": false
  },
  "timing_ms": {
    "load": 0,
    "inference": 0,
    "total": 0
  },
  "npu_busy_delta_us": null
}

Optional localhost HTTP contract, only after CLI is stable:

  • Bind: 127.0.0.1:18829 or another explicitly approved unused prototype port.
  • GET /healthz
  • GET /models
  • POST /v1/vision/caption

Request body:

{
  "path": "/allowed/root/synthetic_scene.png",
  "prompt": "Describe this image in one sentence.",
  "max_new_tokens": 96,
  "device": "CPU"
}

Smoke-test plan using non-private data

Use only generated fixtures under the repo, similar to openvino-doc-image-triage-npu/samples/:

  1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
  2. Run CLI with --allowed-root "$PWD/samples" and assert:
    • JSON parses.
    • external_uploads=false.
    • only basename and SHA-256 are returned by default.
    • captions are non-empty and under a configured token/character limit.
    • unsupported/private paths are rejected.
  3. If an HTTP server is added, start it in foreground on 127.0.0.1, call /healthz and /v1/vision/caption, then stop it.
  4. No private image/document folders and no Obsidian vault content should be used for smoke tests.

NPU busy-time verification plan

Only claim NPU VLM if all of these pass:

  1. Verify the counter is readable:
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
test -r "$BUSY" && before=$(cat "$BUSY")
  1. Run exactly one synthetic-image inference with device=NPU.
  2. Read after=$(cat "$BUSY").
  3. Require after - before > 0 and a response-level npu_busy_delta_us > 0 if the server reports it.
  4. Repeat with a second synthetic image to avoid counting unrelated startup activity only.
  5. If HTTP returns 200 but the sysfs delta is zero, document as NPU not verified and do not call it an NPU service.

No-go / defer criteria

Defer VLM NPU work if any apply:

  • Model export/compile to NPU fails or requires unsupported ops/custom patches.
  • First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
  • NPU busy-time delta is zero or inconsistent.
  • Memory pressure disrupts Whisper :18816, embeddings :18817, or RAG :18810.
  • The only useful path requires processing private images/docs before synthetic smoke tests are stable.
  • Captions are too hallucination-prone for automation decisions without a human-review gate.

Lightweight image triage/classification path

Recommended near-term path: keep openvino-doc-image-triage-npu as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.

Candidate classifier families for a later task:

  • MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
  • Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
  • Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.

Why:

  • The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
  • A small classifier is much lower risk than a VLM and can be used for labels like screenshot, receipt, document, photo, chart.

Endpoint/CLI contract

Extend existing CLI shape rather than introduce a new daemon:

/home/will/.venvs/npu/bin/python triage.py \
  --allowed-root "$PWD" \
  --image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
  --image-classifier-device NPU \
  --pretty \
  samples/synthetic_invoice.png

Response addition:

{
  "classification": {
    "label": "receipt_or_invoice",
    "confidence": 0.82,
    "device": "NPU",
    "method": "openvino_image_classifier",
    "npu_busy_delta_us": 12345
  }
}

Smoke-test plan

Reuse openvino-doc-image-triage-npu/make_samples.py and tests/smoke_test.py; add synthetic image-label assertions only after a classifier model exists. Keep --no-embeddings mode available so the smoke suite can separate classifier NPU proof from embeddings :18817 proof.

No-go / defer criteria

  • Static-shape classifier cannot compile on NPU.
  • Labels are not useful enough to drive an assistant workflow.
  • Classifier output duplicates the existing rule-based fallback.

Audio classification path

Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.

Potential model classes:

  • Speech Commands keyword classifier for short command categories.
  • ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
  • Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU :18816.

Why:

  • The system already has local Whisper NPU transcription.
  • Generic audio classification needs careful threshold tuning and false-positive analysis.
  • Always-on audio processing has privacy and resource implications; keep it explicit and local.

CLI contract

python audio_classify.py \
  --input samples/synthetic_chime.wav \
  --model /home/will/models/openvino-audio-classifier/model.xml \
  --device CPU \
  --json

Response shape:

{
  "ok": true,
  "source_path_basename": "synthetic_chime.wav",
  "source_sha256": "sha256:...",
  "sample_rate": 16000,
  "duration_seconds": 1.2,
  "labels": [
    {"label": "chime", "confidence": 0.76}
  ],
  "device_requested": "CPU",
  "device_observed": "CPU",
  "npu_busy_delta_us": null,
  "privacy": {"external_uploads": false, "raw_audio_logged": false}
}

Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.

Smoke-test plan using non-private data

  1. Generate synthetic WAV files in repo-local samples/: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available.
  2. Run CLI on each file with --allowed-root "$PWD/samples".
  3. Assert JSON parses, durations are bounded, and confidence values are numeric.
  4. Do not stream microphone input or scan private audio directories in smoke tests.
  5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads.

No-go / defer criteria

  • No concrete downstream automation consumes the labels.
  • False positives cannot be characterized on synthetic/public fixtures.
  • It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.

Wake-word path

Recommended first runtime: CPU-only openWakeWord CLI/foreground process with ONNX Runtime or TFLite backend.

NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.

Why:

  • Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
  • The model is small enough that CPU is likely acceptable and simpler.
  • Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
  • openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.

Endpoint/CLI contract

CLI smoke contract:

python wake_word_smoke.py \
  --model hey_jarvis \
  --positive samples/synthetic_wake_positive.wav \
  --negative samples/synthetic_noise.wav \
  --threshold 0.5 \
  --json

Foreground local stream contract, only for manual experiments:

python wake_word_listen.py \
  --model hey_jarvis \
  --threshold 0.5 \
  --vad-threshold 0.3 \
  --oneshot \
  --json

Response/event shape:

{
  "ok": true,
  "model": "hey_jarvis",
  "runtime": "openwakeword-onnxruntime-or-tflite",
  "device": "CPU",
  "threshold": 0.5,
  "events": [
    {"offset_ms": 1280, "score": 0.83, "detected": true}
  ],
  "false_positive_count": 0,
  "npu_busy_delta_us": null,
  "privacy": {"external_uploads": false, "raw_audio_logged": false}
}

If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:

  • GET /healthz
  • POST /v1/wakeword/evaluate-file for explicit files under allowed roots
  • GET /v1/wakeword/events for a manually started foreground listener

Smoke-test plan using non-private data

  1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
python -m venv /tmp/openwakeword-smoke-venv
/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
  1. Use public/generated WAVs only:
    • Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
    • Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
  2. Assert no false positives over a bounded negative fixture set.
  3. Measure per-frame CPU latency and max RSS.
  4. Do not start a persistent microphone listener; manual foreground --oneshot only if explicitly approved.

NPU busy-time verification plan

Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:

  1. Read /sys/class/accel/accel0/device/npu_busy_time_us before a bounded file evaluation.
  2. Run NPU inference on a fixed set of WAV frames.
  3. Read the counter after inference.
  4. Require positive delta and stable predictions matching CPU baseline.
  5. Also verify that keeping the wake-word loop active does not starve Whisper :18816 or embeddings :18817.

No-go / defer criteria

  • CPU openWakeWord has unacceptable false positives on local negative fixtures.
  • A usable positive fixture cannot be created without recording private audio.
  • Always-on microphone capture is required before explicit approval.
  • NPU conversion changes scores materially from CPU baseline.
  • NPU loop increases contention with Whisper/embedding services.

Docs and diagram implications

If these lanes advance beyond feasibility:

  1. Update docs/swarm-infrastructure.md and docs/swarm-infrastructure.html to keep live vs prototype labels clear.
  2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
  3. Update the Service Catalog only after a service is actually approved/live; until then list as prototype/not live or omit.
  4. Architecture diagrams may show:
    • live: RAG :18810, Whisper NPU :18816, embeddings :18817;
    • prototypes: reranker :18818, classifier/router :18819, GenAI worker :18820, doc/image triage optional :18829;
    • VLM/audio/wake-word as CLI feasibility / not live unless a later implementation task creates a service.
  5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.

Overall go/no-go decision

  • Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
  • Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
  • Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
  • Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.