Files

T

William Valentin 2ef9e3dfd2 feat(npu): add bounded OpenVINO GenAI worker

2026-06-04 13:07:51 -07:00

13 KiB

Raw Blame History

Bounded OpenVINO GenAI NPU worker contract

Status: prototype contract implemented locally; not a live Atlas/Hermes routing dependency. Default address: http://127.0.0.1:18820.

Purpose and hard boundary

This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate.

Hard boundaries:

Bind to 127.0.0.1 by default; non-local bind is a code/ops review item, not a runtime flag to casually change.
Do not enable a persistent systemd/Docker service as part of smoke testing.
Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will.
Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories.
Do not log raw prompts or raw request bodies by default.
Treat HTTP success as insufficient for NPU claims; require positive /sys/class/accel/accel0/device/npu_busy_time_us delta for generation.

Recommended model/runtime

Recommended first model:

Model id: OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov
Local path: /home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
Runtime: /home/will/.venvs/npu with openvino-genai==2026.2.0.0
Device: OpenVINO GenAI NPU
Compile cache: /home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4

Why this model/runtime:

It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta.
It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU.
Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model.
The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline.
OpenVINO GenAI LLMPipeline is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls.

Deferred alternatives:

Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention.
CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return device_actual and keep NPU-only health separate.
Manual EXPORT_BLOB/BLOB_PATH: defer until compile latency is proven to dominate despite CACHE_DIR. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades.

Runtime bounds

Pipeline configuration for the first milestone:

CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
MAX_PROMPT_LEN=1024
MIN_RESPONSE_LEN=64
PREFILL_HINT=DYNAMIC
GENERATE_HINT=FAST_COMPILE

Request bounds:

input: required non-empty string; max 6000 characters before prompt templating.
job: one of title, summary, notification, memory_candidate.
max_new_tokens: optional; default by job; hard max 256.
Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars.
Logging: log method/path/status and timing only; never log raw input or generated text by default.

Expected latency target:

Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model.
Warm short jobs: target under 5 seconds for title/notification and under 10 seconds for summary/memory_candidate.
Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings.

These are prototype acceptance targets, not SLOs for live Atlas routing.

CLI contract

Command shape:

cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py \
  --job title \
  --input 'Synthetic non-private text to title.' \
  --max-new-tokens 32

CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be:

0 when the job succeeds and npu_busy_delta_us > 0.
non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive.

The CLI must not write memory, change service routing, or start persistent services.

HTTP contract

Start temporary local server only:

cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820

Endpoints:

GET  /healthz
GET  /models
POST /v1/worker/generate
POST /v1/worker/extract-memory-candidates
POST /v1/worker/condense-notification

GET /healthz response fields:

{
  "ok": true,
  "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
  "model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov",
  "device": "NPU",
  "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4",
  "cache_exists": true,
  "loaded": false,
  "initial_load_ms": null,
  "busy_time_us": 0,
  "max_input_chars": 6000,
  "jobs": ["memory_candidate", "notification", "summary", "title"],
  "bind": "127.0.0.1:18820"
}

POST /v1/worker/generate request:

{
  "job": "summary",
  "input": "Synthetic non-private text to summarize.",
  "max_new_tokens": 80
}

Specialized aliases:

POST /v1/worker/extract-memory-candidates implies job=memory_candidate.
POST /v1/worker/condense-notification implies job=notification.
Backward-compatible request job=memory may map to memory_candidate, but new clients should use memory_candidate.

Successful generation response:

{
  "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
  "device": "NPU",
  "job": "summary",
  "text": "...",
  "json": null,
  "timing_ms": {
    "load": 0.0,
    "initial_load": 10989.08,
    "generate": 3157.94,
    "total": 3157.94
  },
  "npu_busy_delta_us": 2650724,
  "npu_busy_before_us": 123,
  "npu_busy_after_us": 2650847,
  "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
}

Validation/error behavior:

Unsupported path: 404 JSON {"error":"not found"}.
Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON {"error":"..."} with non-2xx preferred for future implementations. The current stdlib prototype returns 400 for these errors.
If npu_busy_delta_us <= 0, the response should be treated as failed by smoke tests even if an HTTP handler emitted 200; the refreshed prototype returns 503 with the generation payload plus an error field.

Prompt/job contract

title:

Input: short task/log/message excerpt.
Output: one title, 8 words or fewer, no markdown required.
Default max_new_tokens: 32.

summary:

Input: synthetic/non-private text excerpt.
Output: one short paragraph or up to 4 bullets.
Default max_new_tokens: 160.

notification:

Input: synthetic/non-private alert/log excerpt.
Output target: JSON object with severity, category, summary, action_needed.
Default max_new_tokens: 96.
Client must tolerate json: null and parse/validate before using output.

memory_candidate:

Input: synthetic/non-private conversation excerpt.
Output target: JSON object with candidates and notes; candidates are proposals only.
Default max_new_tokens: 192.
This worker must never call Hermes memory tools or write durable memory directly.

Smoke-test plan using non-private data

Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this:

Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed.

Direct NPU smoke:

cd /home/will/lab/swarm/openvino-genai-npu-worker
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
/home/will/.venvs/npu/bin/python smoke_llm_npu.py \
  --prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \
  --max-new-tokens 24
status=$?
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
printf 'external_busy_delta_us=%s\n' "$((after-before))"
test "$status" -eq 0
test "$((after-before))" -gt 0

Temporary HTTP smoke:

cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 &
pid=$!
trap 'kill "$pid" 2>/dev/null || true' EXIT

curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
curl -fsS http://127.0.0.1:18820/v1/worker/generate \
  -H 'Content-Type: application/json' \
  -d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \
  | tee /tmp/openvino-genai-worker-smoke.json \
  | python -m json.tool
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
python - <<'PY'
import json
p=json.load(open('/tmp/openvino-genai-worker-smoke.json'))
assert p['npu_busy_delta_us'] > 0, p
assert p['device'] == 'NPU', p
PY
test "$((after-before))" -gt 0
kill "$pid"
trap - EXIT

Also verify the temporary listener is gone:

ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true

Unit tests that do not load the model or require private data:

cd /home/will/lab/swarm/openvino-genai-npu-worker
python -m pytest -q

NPU busy-time verification plan

Acceptance for any NPU claim requires all of the following:

Confirm the sysfs counter exists and is readable: test -r /sys/class/accel/accel0/device/npu_busy_time_us.
Read busy_before immediately before the generation call.
Run exactly one bounded generation against the candidate worker.
Read busy_after immediately after generation completes.
Require busy_after > busy_before and response npu_busy_delta_us > 0.
Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff.
If the counter is unchanged, mark the smoke as failed even if HTTP returned 200 and text was generated.

Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter.

Docs/diagram implications

If this worker is kept as a prototype, docs and diagrams should show:

Live baseline remains RAG :18810, Whisper NPU :18816, embeddings :18817.
GenAI worker :18820 is proposed/prototype/not-live unless explicitly approved and enabled.
No arrow from Hermes/Atlas gateway or LiteLLM primary routing to :18820 unless a later approved integration actually exists.
Runbooks should include the CLI/HTTP smoke commands, ss listener checks, and NPU busy-time counter checks.
Service maps should label this as "bounded background generation" rather than "chat" or "assistant model".

Explicit no-go / defer criteria

No-go for implementation or promotion:

Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable.
/sys/class/accel/accel0/device/npu_busy_time_us is unreadable or does not increase during generation.
Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services.
The worker needs private documents/images/chat logs for smoke testing.
The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value.
The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs.
The service logs raw prompt bodies by default.
Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff.

Defer, do not solve in this lane:

Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration.
RAG query rewriting, RAG answer generation, or collection mutation.
Private document/image triage.
Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.

13 KiB Raw Blame History