From cb874f9743848208bed902d8ad36392f3187e6ef Mon Sep 17 00:00:00 2001 From: William Valentin Date: Thu, 4 Jun 2026 12:06:28 -0700 Subject: [PATCH] docs: define OpenVINO GenAI NPU worker contract --- openvino-genai-npu-worker/CONTRACT.md | 299 ++++++++++++++++++++++++++ openvino-genai-npu-worker/README.md | 1 + 2 files changed, 300 insertions(+) create mode 100644 openvino-genai-npu-worker/CONTRACT.md diff --git a/openvino-genai-npu-worker/CONTRACT.md b/openvino-genai-npu-worker/CONTRACT.md new file mode 100644 index 0000000..babebbb --- /dev/null +++ b/openvino-genai-npu-worker/CONTRACT.md @@ -0,0 +1,299 @@ +# Bounded OpenVINO GenAI NPU worker contract + +Status: proposed prototype contract; not a live Atlas/Hermes routing dependency. +Default address: `http://127.0.0.1:18820`. + +## Purpose and hard boundary + +This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate. + +Hard boundaries: + +- Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change. +- Do not enable a persistent systemd/Docker service as part of smoke testing. +- Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will. +- Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories. +- Do not log raw prompts or raw request bodies by default. +- Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation. + +## Recommended model/runtime + +Recommended first model: + +- Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov` +- Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov` +- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0` +- Device: OpenVINO GenAI `NPU` +- Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4` + +Why this model/runtime: + +- It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta. +- It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU. +- Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model. +- The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline. +- OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls. + +Deferred alternatives: + +- Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention. +- CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate. +- Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades. + +## Runtime bounds + +Pipeline configuration for the first milestone: + +```text +CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4 +MAX_PROMPT_LEN=1024 +MIN_RESPONSE_LEN=64 +PREFILL_HINT=DYNAMIC +GENERATE_HINT=FAST_COMPILE +``` + +Request bounds: + +- `input`: required non-empty string; max `6000` characters before prompt templating. +- `job`: one of `title`, `summary`, `notification`, `memory_candidate`. +- `max_new_tokens`: optional; default by job; hard max `256`. +- Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars. +- Logging: log method/path/status and timing only; never log raw `input` or generated text by default. + +Expected latency target: + +- Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model. +- Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`. +- Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings. + +These are prototype acceptance targets, not SLOs for live Atlas routing. + +## CLI contract + +Command shape: + +```bash +cd /home/will/lab/swarm/openvino-genai-npu-worker +/home/will/.venvs/npu/bin/python worker.py \ + --job title \ + --input 'Synthetic non-private text to title.' \ + --max-new-tokens 32 +``` + +CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be: + +- `0` when the job succeeds and `npu_busy_delta_us > 0`. +- non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive. + +The CLI must not write memory, change service routing, or start persistent services. + +## HTTP contract + +Start temporary local server only: + +```bash +cd /home/will/lab/swarm/openvino-genai-npu-worker +/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 +``` + +Endpoints: + +```text +GET /healthz +GET /models +POST /v1/worker/generate +POST /v1/worker/extract-memory-candidates +POST /v1/worker/condense-notification +``` + +`GET /healthz` response fields: + +```json +{ + "ok": true, + "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov", + "model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov", + "device": "NPU", + "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4", + "cache_exists": true, + "loaded": false, + "initial_load_ms": null, + "busy_time_us": 0, + "max_input_chars": 6000, + "jobs": ["memory_candidate", "notification", "summary", "title"], + "bind": "127.0.0.1:18820" +} +``` + +`POST /v1/worker/generate` request: + +```json +{ + "job": "summary", + "input": "Synthetic non-private text to summarize.", + "max_new_tokens": 80 +} +``` + +Specialized aliases: + +- `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`. +- `POST /v1/worker/condense-notification` implies `job=notification`. +- Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`. + +Successful generation response: + +```json +{ + "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov", + "device": "NPU", + "job": "summary", + "text": "...", + "json": null, + "timing_ms": { + "load": 0.0, + "initial_load": 10989.08, + "generate": 3157.94, + "total": 3157.94 + }, + "npu_busy_delta_us": 2650724, + "npu_busy_before_us": 123, + "npu_busy_after_us": 2650847, + "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4" +} +``` + +Validation/error behavior: + +- Unsupported path: `404` JSON `{"error":"not found"}`. +- Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors. +- If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`. + +## Prompt/job contract + +`title`: + +- Input: short task/log/message excerpt. +- Output: one title, 8 words or fewer, no markdown required. +- Default `max_new_tokens`: 32. + +`summary`: + +- Input: synthetic/non-private text excerpt. +- Output: one short paragraph or up to 4 bullets. +- Default `max_new_tokens`: 160. + +`notification`: + +- Input: synthetic/non-private alert/log excerpt. +- Output target: JSON object with `severity`, `category`, `summary`, `action_needed`. +- Default `max_new_tokens`: 96. +- Client must tolerate `json: null` and parse/validate before using output. + +`memory_candidate`: + +- Input: synthetic/non-private conversation excerpt. +- Output target: JSON object with `candidates` and `notes`; candidates are proposals only. +- Default `max_new_tokens`: 192. +- This worker must never call Hermes memory tools or write durable memory directly. + +## Smoke-test plan using non-private data + +Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this: + +```text +Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed. +``` + +Direct NPU smoke: + +```bash +cd /home/will/lab/swarm/openvino-genai-npu-worker +before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) +/home/will/.venvs/npu/bin/python smoke_llm_npu.py \ + --prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \ + --max-new-tokens 24 +status=$? +after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) +printf 'external_busy_delta_us=%s\n' "$((after-before))" +test "$status" -eq 0 +test "$((after-before))" -gt 0 +``` + +Temporary HTTP smoke: + +```bash +cd /home/will/lab/swarm/openvino-genai-npu-worker +/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 & +pid=$! +trap 'kill "$pid" 2>/dev/null || true' EXIT + +curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool +before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) +curl -fsS http://127.0.0.1:18820/v1/worker/generate \ + -H 'Content-Type: application/json' \ + -d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \ + | tee /tmp/openvino-genai-worker-smoke.json \ + | python -m json.tool +after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us) +python - <<'PY' +import json +p=json.load(open('/tmp/openvino-genai-worker-smoke.json')) +assert p['npu_busy_delta_us'] > 0, p +assert p['device'] == 'NPU', p +PY +test "$((after-before))" -gt 0 +kill "$pid" +trap - EXIT +``` + +Also verify the temporary listener is gone: + +```bash +ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true +``` + +## NPU busy-time verification plan + +Acceptance for any NPU claim requires all of the following: + +1. Confirm the sysfs counter exists and is readable: + `test -r /sys/class/accel/accel0/device/npu_busy_time_us`. +2. Read `busy_before` immediately before the generation call. +3. Run exactly one bounded generation against the candidate worker. +4. Read `busy_after` immediately after generation completes. +5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`. +6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff. +7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated. + +Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter. + +## Docs/diagram implications + +If this worker is kept as a prototype, docs and diagrams should show: + +- Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`. +- GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled. +- No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists. +- Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks. +- Service maps should label this as "bounded background generation" rather than "chat" or "assistant model". + +## Explicit no-go / defer criteria + +No-go for implementation or promotion: + +- Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable. +- `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation. +- Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services. +- The worker needs private documents/images/chat logs for smoke testing. +- The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value. +- The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs. +- The service logs raw prompt bodies by default. +- Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff. + +Defer, do not solve in this lane: + +- Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration. +- RAG query rewriting, RAG answer generation, or collection mutation. +- Private document/image triage. +- Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost. diff --git a/openvino-genai-npu-worker/README.md b/openvino-genai-npu-worker/README.md index c7b241b..1baf519 100644 --- a/openvino-genai-npu-worker/README.md +++ b/openvino-genai-npu-worker/README.md @@ -15,6 +15,7 @@ The worker does not write memory, does not restart Atlas/Hermes, does not change ## Files +- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria. - `worker.py` — stdlib HTTP API plus CLI wrapper. - `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification. - `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.