# Bounded OpenVINO GenAI NPU worker contract Status: prototype contract implemented locally; not a live Atlas/Hermes routing dependency. Default address: `http://127.0.0.1:18820`. ## Purpose and hard boundary This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate. Hard boundaries: - Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change. - Do not enable a persistent systemd/Docker service as part of smoke testing. - Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will. - Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories. - Do not log raw prompts or raw request bodies by default. - Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation. ## Recommended model/runtime Recommended first model: - Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov` - Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov` - Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0` - Device: OpenVINO GenAI `NPU` - Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4` Why this model/runtime: - It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta. - It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU. - Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model. - The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline. - OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls. Deferred alternatives: - Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention. - CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate. - Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades. ## Runtime bounds Pipeline configuration for the first milestone: ```text CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4 MAX_PROMPT_LEN=1024 MIN_RESPONSE_LEN=64 PREFILL_HINT=DYNAMIC GENERATE_HINT=FAST_COMPILE ``` Request bounds: - `input`: required non-empty string; max `6000` characters before prompt templating. - `job`: one of `title`, `summary`, `notification`, `memory_candidate`. - `max_new_tokens`: optional; default by job; hard max `256`. - Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars. - Logging: log method/path/status and timing only; never log raw `input` or generated text by default. Expected latency target: - Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model. - Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`. - Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings. These are prototype acceptance targets, not SLOs for live Atlas routing. ## CLI contract Command shape: ```bash cd /home/will/lab/swarm/openvino-genai-npu-worker /home/will/.venvs/npu/bin/python worker.py \ --job title \ --input 'Synthetic non-private text to title.' \ --max-new-tokens 32 ``` CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be: - `0` when the job succeeds and `npu_busy_delta_us > 0`. - non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive. The CLI must not write memory, change service routing, or start persistent services. ## HTTP contract Start temporary local server only: ```bash cd /home/will/lab/swarm/openvino-genai-npu-worker /home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 ``` Endpoints: ```text GET /healthz GET /models POST /v1/worker/generate POST /v1/worker/extract-memory-candidates POST /v1/worker/condense-notification ``` `GET /healthz` response fields: ```json { "ok": true, "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov", "model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov", "device": "NPU", "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4", "cache_exists": true, "loaded": false, "initial_load_ms": null, "busy_time_us": 0, "max_input_chars": 6000, "jobs": ["memory_candidate", "notification", "summary", "title"], "bind": "127.0.0.1:18820" } ``` `POST /v1/worker/generate` request: ```json { "job": "summary", "input": "Synthetic non-private text to summarize.", "max_new_tokens": 80 } ``` Specialized aliases: - `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`. - `POST /v1/worker/condense-notification` implies `job=notification`. - Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`. Successful generation response: ```json { "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov", "device": "NPU", "job": "summary", "text": "...", "json": null, "timing_ms": { "load": 0.0, "initial_load": 10989.08, "generate": 3157.94, "total": 3157.94 }, "npu_busy_delta_us": 2650724, "npu_busy_before_us": 123, "npu_busy_after_us": 2650847, "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4" } ``` Validation/error behavior: - Unsupported path: `404` JSON `{"error":"not found"}`. - Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors. - If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`; the refreshed prototype returns `503` with the generation payload plus an `error` field. ## Prompt/job contract `title`: - Input: short task/log/message excerpt. - Output: one title, 8 words or fewer, no markdown required. - Default `max_new_tokens`: 32. `summary`: - Input: synthetic/non-private text excerpt. - Output: one short paragraph or up to 4 bullets. - Default `max_new_tokens`: 160. `notification`: - Input: synthetic/non-private alert/log excerpt. - Output target: JSON object with `severity`, `category`, `summary`, `action_needed`. - Default `max_new_tokens`: 96. - Client must tolerate `json: null` and parse/validate before using output. `memory_candidate`: - Input: synthetic/non-private conversation excerpt. - Output target: JSON object with `candidates` and `notes`; candidates are proposals only. - Default `max_new_tokens`: 192. - This worker must never call Hermes memory tools or write durable memory directly. ## Smoke-test plan using non-private data Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this: ```text Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed. ``` Direct NPU smoke: ```bash cd /home/will/lab/swarm/openvino-genai-npu-worker before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) /home/will/.venvs/npu/bin/python smoke_llm_npu.py \ --prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \ --max-new-tokens 24 status=$? after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) printf 'external_busy_delta_us=%s\n' "$((after-before))" test "$status" -eq 0 test "$((after-before))" -gt 0 ``` Temporary HTTP smoke: ```bash cd /home/will/lab/swarm/openvino-genai-npu-worker /home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 & pid=$! trap 'kill "$pid" 2>/dev/null || true' EXIT curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) curl -fsS http://127.0.0.1:18820/v1/worker/generate \ -H 'Content-Type: application/json' \ -d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \ | tee /tmp/openvino-genai-worker-smoke.json \ | python -m json.tool after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) python - <<'PY' import json p=json.load(open('/tmp/openvino-genai-worker-smoke.json')) assert p['npu_busy_delta_us'] > 0, p assert p['device'] == 'NPU', p PY test "$((after-before))" -gt 0 kill "$pid" trap - EXIT ``` Also verify the temporary listener is gone: ```bash ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true ``` Unit tests that do not load the model or require private data: ```bash cd /home/will/lab/swarm/openvino-genai-npu-worker python -m pytest -q ``` ## NPU busy-time verification plan Acceptance for any NPU claim requires all of the following: 1. Confirm the sysfs counter exists and is readable: `test -r /sys/class/accel/accel0/device/npu_busy_time_us`. 2. Read `busy_before` immediately before the generation call. 3. Run exactly one bounded generation against the candidate worker. 4. Read `busy_after` immediately after generation completes. 5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`. 6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff. 7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated. Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter. ## Docs/diagram implications If this worker is kept as a prototype, docs and diagrams should show: - Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`. - GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled. - No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists. - Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks. - Service maps should label this as "bounded background generation" rather than "chat" or "assistant model". ## Explicit no-go / defer criteria No-go for implementation or promotion: - Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable. - `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation. - Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services. - The worker needs private documents/images/chat logs for smoke testing. - The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value. - The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs. - The service logs raw prompt bodies by default. - Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff. Defer, do not solve in this lane: - Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration. - RAG query rewriting, RAG answer generation, or collection mutation. - Private document/image triage. - Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.