docs: define OpenVINO GenAI NPU worker contract
This commit is contained in:
@@ -0,0 +1,299 @@
|
||||
# Bounded OpenVINO GenAI NPU worker contract
|
||||
|
||||
Status: proposed prototype contract; not a live Atlas/Hermes routing dependency.
|
||||
Default address: `http://127.0.0.1:18820`.
|
||||
|
||||
## Purpose and hard boundary
|
||||
|
||||
This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate.
|
||||
|
||||
Hard boundaries:
|
||||
|
||||
- Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change.
|
||||
- Do not enable a persistent systemd/Docker service as part of smoke testing.
|
||||
- Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will.
|
||||
- Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories.
|
||||
- Do not log raw prompts or raw request bodies by default.
|
||||
- Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation.
|
||||
|
||||
## Recommended model/runtime
|
||||
|
||||
Recommended first model:
|
||||
|
||||
- Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`
|
||||
- Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov`
|
||||
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`
|
||||
- Device: OpenVINO GenAI `NPU`
|
||||
- Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4`
|
||||
|
||||
Why this model/runtime:
|
||||
|
||||
- It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta.
|
||||
- It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU.
|
||||
- Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model.
|
||||
- The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline.
|
||||
- OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls.
|
||||
|
||||
Deferred alternatives:
|
||||
|
||||
- Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention.
|
||||
- CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate.
|
||||
- Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades.
|
||||
|
||||
## Runtime bounds
|
||||
|
||||
Pipeline configuration for the first milestone:
|
||||
|
||||
```text
|
||||
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
||||
MAX_PROMPT_LEN=1024
|
||||
MIN_RESPONSE_LEN=64
|
||||
PREFILL_HINT=DYNAMIC
|
||||
GENERATE_HINT=FAST_COMPILE
|
||||
```
|
||||
|
||||
Request bounds:
|
||||
|
||||
- `input`: required non-empty string; max `6000` characters before prompt templating.
|
||||
- `job`: one of `title`, `summary`, `notification`, `memory_candidate`.
|
||||
- `max_new_tokens`: optional; default by job; hard max `256`.
|
||||
- Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars.
|
||||
- Logging: log method/path/status and timing only; never log raw `input` or generated text by default.
|
||||
|
||||
Expected latency target:
|
||||
|
||||
- Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model.
|
||||
- Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`.
|
||||
- Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings.
|
||||
|
||||
These are prototype acceptance targets, not SLOs for live Atlas routing.
|
||||
|
||||
## CLI contract
|
||||
|
||||
Command shape:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py \
|
||||
--job title \
|
||||
--input 'Synthetic non-private text to title.' \
|
||||
--max-new-tokens 32
|
||||
```
|
||||
|
||||
CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be:
|
||||
|
||||
- `0` when the job succeeds and `npu_busy_delta_us > 0`.
|
||||
- non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive.
|
||||
|
||||
The CLI must not write memory, change service routing, or start persistent services.
|
||||
|
||||
## HTTP contract
|
||||
|
||||
Start temporary local server only:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
|
||||
```
|
||||
|
||||
Endpoints:
|
||||
|
||||
```text
|
||||
GET /healthz
|
||||
GET /models
|
||||
POST /v1/worker/generate
|
||||
POST /v1/worker/extract-memory-candidates
|
||||
POST /v1/worker/condense-notification
|
||||
```
|
||||
|
||||
`GET /healthz` response fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"device": "NPU",
|
||||
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4",
|
||||
"cache_exists": true,
|
||||
"loaded": false,
|
||||
"initial_load_ms": null,
|
||||
"busy_time_us": 0,
|
||||
"max_input_chars": 6000,
|
||||
"jobs": ["memory_candidate", "notification", "summary", "title"],
|
||||
"bind": "127.0.0.1:18820"
|
||||
}
|
||||
```
|
||||
|
||||
`POST /v1/worker/generate` request:
|
||||
|
||||
```json
|
||||
{
|
||||
"job": "summary",
|
||||
"input": "Synthetic non-private text to summarize.",
|
||||
"max_new_tokens": 80
|
||||
}
|
||||
```
|
||||
|
||||
Specialized aliases:
|
||||
|
||||
- `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`.
|
||||
- `POST /v1/worker/condense-notification` implies `job=notification`.
|
||||
- Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`.
|
||||
|
||||
Successful generation response:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"device": "NPU",
|
||||
"job": "summary",
|
||||
"text": "...",
|
||||
"json": null,
|
||||
"timing_ms": {
|
||||
"load": 0.0,
|
||||
"initial_load": 10989.08,
|
||||
"generate": 3157.94,
|
||||
"total": 3157.94
|
||||
},
|
||||
"npu_busy_delta_us": 2650724,
|
||||
"npu_busy_before_us": 123,
|
||||
"npu_busy_after_us": 2650847,
|
||||
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
|
||||
}
|
||||
```
|
||||
|
||||
Validation/error behavior:
|
||||
|
||||
- Unsupported path: `404` JSON `{"error":"not found"}`.
|
||||
- Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors.
|
||||
- If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`.
|
||||
|
||||
## Prompt/job contract
|
||||
|
||||
`title`:
|
||||
|
||||
- Input: short task/log/message excerpt.
|
||||
- Output: one title, 8 words or fewer, no markdown required.
|
||||
- Default `max_new_tokens`: 32.
|
||||
|
||||
`summary`:
|
||||
|
||||
- Input: synthetic/non-private text excerpt.
|
||||
- Output: one short paragraph or up to 4 bullets.
|
||||
- Default `max_new_tokens`: 160.
|
||||
|
||||
`notification`:
|
||||
|
||||
- Input: synthetic/non-private alert/log excerpt.
|
||||
- Output target: JSON object with `severity`, `category`, `summary`, `action_needed`.
|
||||
- Default `max_new_tokens`: 96.
|
||||
- Client must tolerate `json: null` and parse/validate before using output.
|
||||
|
||||
`memory_candidate`:
|
||||
|
||||
- Input: synthetic/non-private conversation excerpt.
|
||||
- Output target: JSON object with `candidates` and `notes`; candidates are proposals only.
|
||||
- Default `max_new_tokens`: 192.
|
||||
- This worker must never call Hermes memory tools or write durable memory directly.
|
||||
|
||||
## Smoke-test plan using non-private data
|
||||
|
||||
Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this:
|
||||
|
||||
```text
|
||||
Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed.
|
||||
```
|
||||
|
||||
Direct NPU smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
/home/will/.venvs/npu/bin/python smoke_llm_npu.py \
|
||||
--prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \
|
||||
--max-new-tokens 24
|
||||
status=$?
|
||||
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
printf 'external_busy_delta_us=%s\n' "$((after-before))"
|
||||
test "$status" -eq 0
|
||||
test "$((after-before))" -gt 0
|
||||
```
|
||||
|
||||
Temporary HTTP smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 &
|
||||
pid=$!
|
||||
trap 'kill "$pid" 2>/dev/null || true' EXIT
|
||||
|
||||
curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool
|
||||
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
curl -fsS http://127.0.0.1:18820/v1/worker/generate \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \
|
||||
| tee /tmp/openvino-genai-worker-smoke.json \
|
||||
| python -m json.tool
|
||||
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
python - <<'PY'
|
||||
import json
|
||||
p=json.load(open('/tmp/openvino-genai-worker-smoke.json'))
|
||||
assert p['npu_busy_delta_us'] > 0, p
|
||||
assert p['device'] == 'NPU', p
|
||||
PY
|
||||
test "$((after-before))" -gt 0
|
||||
kill "$pid"
|
||||
trap - EXIT
|
||||
```
|
||||
|
||||
Also verify the temporary listener is gone:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true
|
||||
```
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
Acceptance for any NPU claim requires all of the following:
|
||||
|
||||
1. Confirm the sysfs counter exists and is readable:
|
||||
`test -r /sys/class/accel/accel0/device/npu_busy_time_us`.
|
||||
2. Read `busy_before` immediately before the generation call.
|
||||
3. Run exactly one bounded generation against the candidate worker.
|
||||
4. Read `busy_after` immediately after generation completes.
|
||||
5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`.
|
||||
6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff.
|
||||
7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated.
|
||||
|
||||
Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter.
|
||||
|
||||
## Docs/diagram implications
|
||||
|
||||
If this worker is kept as a prototype, docs and diagrams should show:
|
||||
|
||||
- Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`.
|
||||
- GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled.
|
||||
- No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists.
|
||||
- Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks.
|
||||
- Service maps should label this as "bounded background generation" rather than "chat" or "assistant model".
|
||||
|
||||
## Explicit no-go / defer criteria
|
||||
|
||||
No-go for implementation or promotion:
|
||||
|
||||
- Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable.
|
||||
- `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation.
|
||||
- Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services.
|
||||
- The worker needs private documents/images/chat logs for smoke testing.
|
||||
- The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value.
|
||||
- The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs.
|
||||
- The service logs raw prompt bodies by default.
|
||||
- Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff.
|
||||
|
||||
Defer, do not solve in this lane:
|
||||
|
||||
- Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration.
|
||||
- RAG query rewriting, RAG answer generation, or collection mutation.
|
||||
- Private document/image triage.
|
||||
- Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.
|
||||
@@ -15,6 +15,7 @@ The worker does not write memory, does not restart Atlas/Hermes, does not change
|
||||
|
||||
## Files
|
||||
|
||||
- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
|
||||
- `worker.py` — stdlib HTTP API plus CLI wrapper.
|
||||
- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
|
||||
- `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.
|
||||
|
||||
Reference in New Issue
Block a user