feat(npu): add bounded OpenVINO GenAI worker

This commit is contained in:
William Valentin
2026-06-04 13:07:51 -07:00
parent d3373e7234
commit 2ef9e3dfd2
7 changed files with 972 additions and 0 deletions
+306
View File
@@ -0,0 +1,306 @@
# Bounded OpenVINO GenAI NPU worker contract
Status: prototype contract implemented locally; not a live Atlas/Hermes routing dependency.
Default address: `http://127.0.0.1:18820`.
## Purpose and hard boundary
This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate.
Hard boundaries:
- Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change.
- Do not enable a persistent systemd/Docker service as part of smoke testing.
- Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will.
- Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories.
- Do not log raw prompts or raw request bodies by default.
- Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation.
## Recommended model/runtime
Recommended first model:
- Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`
- Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov`
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`
- Device: OpenVINO GenAI `NPU`
- Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4`
Why this model/runtime:
- It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta.
- It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU.
- Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model.
- The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline.
- OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls.
Deferred alternatives:
- Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention.
- CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate.
- Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades.
## Runtime bounds
Pipeline configuration for the first milestone:
```text
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
MAX_PROMPT_LEN=1024
MIN_RESPONSE_LEN=64
PREFILL_HINT=DYNAMIC
GENERATE_HINT=FAST_COMPILE
```
Request bounds:
- `input`: required non-empty string; max `6000` characters before prompt templating.
- `job`: one of `title`, `summary`, `notification`, `memory_candidate`.
- `max_new_tokens`: optional; default by job; hard max `256`.
- Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars.
- Logging: log method/path/status and timing only; never log raw `input` or generated text by default.
Expected latency target:
- Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model.
- Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`.
- Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings.
These are prototype acceptance targets, not SLOs for live Atlas routing.
## CLI contract
Command shape:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py \
--job title \
--input 'Synthetic non-private text to title.' \
--max-new-tokens 32
```
CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be:
- `0` when the job succeeds and `npu_busy_delta_us > 0`.
- non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive.
The CLI must not write memory, change service routing, or start persistent services.
## HTTP contract
Start temporary local server only:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
```
Endpoints:
```text
GET /healthz
GET /models
POST /v1/worker/generate
POST /v1/worker/extract-memory-candidates
POST /v1/worker/condense-notification
```
`GET /healthz` response fields:
```json
{
"ok": true,
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
"model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov",
"device": "NPU",
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4",
"cache_exists": true,
"loaded": false,
"initial_load_ms": null,
"busy_time_us": 0,
"max_input_chars": 6000,
"jobs": ["memory_candidate", "notification", "summary", "title"],
"bind": "127.0.0.1:18820"
}
```
`POST /v1/worker/generate` request:
```json
{
"job": "summary",
"input": "Synthetic non-private text to summarize.",
"max_new_tokens": 80
}
```
Specialized aliases:
- `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`.
- `POST /v1/worker/condense-notification` implies `job=notification`.
- Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`.
Successful generation response:
```json
{
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
"device": "NPU",
"job": "summary",
"text": "...",
"json": null,
"timing_ms": {
"load": 0.0,
"initial_load": 10989.08,
"generate": 3157.94,
"total": 3157.94
},
"npu_busy_delta_us": 2650724,
"npu_busy_before_us": 123,
"npu_busy_after_us": 2650847,
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
}
```
Validation/error behavior:
- Unsupported path: `404` JSON `{"error":"not found"}`.
- Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors.
- If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`; the refreshed prototype returns `503` with the generation payload plus an `error` field.
## Prompt/job contract
`title`:
- Input: short task/log/message excerpt.
- Output: one title, 8 words or fewer, no markdown required.
- Default `max_new_tokens`: 32.
`summary`:
- Input: synthetic/non-private text excerpt.
- Output: one short paragraph or up to 4 bullets.
- Default `max_new_tokens`: 160.
`notification`:
- Input: synthetic/non-private alert/log excerpt.
- Output target: JSON object with `severity`, `category`, `summary`, `action_needed`.
- Default `max_new_tokens`: 96.
- Client must tolerate `json: null` and parse/validate before using output.
`memory_candidate`:
- Input: synthetic/non-private conversation excerpt.
- Output target: JSON object with `candidates` and `notes`; candidates are proposals only.
- Default `max_new_tokens`: 192.
- This worker must never call Hermes memory tools or write durable memory directly.
## Smoke-test plan using non-private data
Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this:
```text
Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed.
```
Direct NPU smoke:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
/home/will/.venvs/npu/bin/python smoke_llm_npu.py \
--prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \
--max-new-tokens 24
status=$?
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
printf 'external_busy_delta_us=%s\n' "$((after-before))"
test "$status" -eq 0
test "$((after-before))" -gt 0
```
Temporary HTTP smoke:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 &
pid=$!
trap 'kill "$pid" 2>/dev/null || true' EXIT
curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
curl -fsS http://127.0.0.1:18820/v1/worker/generate \
-H 'Content-Type: application/json' \
-d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \
| tee /tmp/openvino-genai-worker-smoke.json \
| python -m json.tool
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
python - <<'PY'
import json
p=json.load(open('/tmp/openvino-genai-worker-smoke.json'))
assert p['npu_busy_delta_us'] > 0, p
assert p['device'] == 'NPU', p
PY
test "$((after-before))" -gt 0
kill "$pid"
trap - EXIT
```
Also verify the temporary listener is gone:
```bash
ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true
```
Unit tests that do not load the model or require private data:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
python -m pytest -q
```
## NPU busy-time verification plan
Acceptance for any NPU claim requires all of the following:
1. Confirm the sysfs counter exists and is readable:
`test -r /sys/class/accel/accel0/device/npu_busy_time_us`.
2. Read `busy_before` immediately before the generation call.
3. Run exactly one bounded generation against the candidate worker.
4. Read `busy_after` immediately after generation completes.
5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`.
6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff.
7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated.
Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter.
## Docs/diagram implications
If this worker is kept as a prototype, docs and diagrams should show:
- Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`.
- GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled.
- No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists.
- Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks.
- Service maps should label this as "bounded background generation" rather than "chat" or "assistant model".
## Explicit no-go / defer criteria
No-go for implementation or promotion:
- Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable.
- `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation.
- Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services.
- The worker needs private documents/images/chat logs for smoke testing.
- The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value.
- The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs.
- The service logs raw prompt bodies by default.
- Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff.
Defer, do not solve in this lane:
- Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration.
- RAG query rewriting, RAG answer generation, or collection mutation.
- Private document/image triage.
- Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.
+142
View File
@@ -0,0 +1,142 @@
# OpenVINO GenAI NPU worker prototype
Local-only prototype for cheap bounded background generation on Will's Intel NPU. It is intentionally isolated from primary Atlas/Hermes routing.
## What it does
- Model: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`.
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`.
- Device: OpenVINO GenAI `NPU`.
- Default bind: `127.0.0.1:18820`.
- Jobs: `title`, `summary`, `notification`, `memory_candidate`.
- Prompt/input limits: 6000 chars, `MAX_PROMPT_LEN=1024`, max 256 generated tokens.
The worker does not write memory, does not restart Atlas/Hermes, does not change primary routing, and does not log raw prompt bodies by default.
## Files
- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
- `worker.py` — stdlib HTTP API plus CLI wrapper.
- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
- `tests/test_worker.py` — unit tests with a fake GenAI pipeline and synthetic busy-time counter.
- `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.
## Model/cache
Downloaded model path:
```text
/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
```
OpenVINO compile cache path:
```text
/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
```
NPU pipeline config used by the prototype:
```python
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
MAX_PROMPT_LEN=1024
MIN_RESPONSE_LEN=64
PREFILL_HINT=DYNAMIC
GENERATE_HINT=FAST_COMPILE
```
AOT/blob note: first milestone uses `CACHE_DIR` only. Do not switch to manual `EXPORT_BLOB`/`BLOB_PATH` until compile latency is proven to be the bottleneck. If explicit blobs are used later, record OpenVINO version, NPU compiler version, driver version, model id, quantization flags, and source weights path; invalidate blobs after OpenVINO/NPU driver upgrades.
## Direct smoke test
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python smoke_llm_npu.py
```
Acceptance requires `npu_busy_delta_us > 0`.
Observed cold-ish smoke after download/cache setup:
```json
{
"text": "\"Atlas Summarizes NPU Worker Options Requested by User\"",
"timing_ms": {"load": 10989.08, "generate": 3157.94, "total": 14147.02},
"npu_busy_delta_us": 2650724
}
```
## CLI usage
```bash
/home/will/.venvs/npu/bin/python worker.py \
--job title \
--input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'
```
Exit code is non-zero if validation fails, generation fails, or the worker-reported `npu_busy_delta_us` is not positive.
## HTTP usage
Start locally only:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
ss -ltnp | grep ':18820' && { echo 'port 18820 already in use'; exit 1; } || true
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
```
The server also refuses startup if a listener is already accepting connections on `127.0.0.1:18820`.
Endpoints:
```text
GET /healthz
GET /models
POST /v1/worker/generate
POST /v1/worker/extract-memory-candidates
POST /v1/worker/condense-notification
```
Example:
```bash
curl -s http://127.0.0.1:18820/v1/worker/generate \
-H 'Content-Type: application/json' \
-d '{"job":"summary","input":"Build a bounded local NPU worker for small generation tasks, no primary routing changes.","max_new_tokens":80}' \
| python -m json.tool
```
Response includes `npu_busy_delta_us`; treat zero as failure even if HTTP status is 200.
## Unit tests
These tests use only synthetic strings and a fake GenAI pipeline, so they do not load the model or touch private data:
```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
python -m pytest -q
```
## Environment variables
```text
OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
OV_GENAI_NPU_HOST=127.0.0.1
OV_GENAI_NPU_PORT=18820
```
Only `127.0.0.1` is accepted by the current prototype; wider binds require an explicit code change and approval.
## Optional systemd user service
A draft unit exists at `systemd/openvino-genai-npu-worker.service` for later review. Do not copy, enable, or autostart it unless Will explicitly approves persistent service enablement. Foreground smoke on `127.0.0.1:18820` plus positive sysfs NPU busy-time delta is required before any installation discussion.
## Safety boundaries
- Binds only to `127.0.0.1` by default; non-local bind is refused in code.
- No raw request-body logging.
- No private external uploads.
- No Atlas/Hermes gateway restarts or primary model routing changes.
- NPU access is serialized with a process lock because the NPU is a shared resource with existing services.
+2
View File
@@ -0,0 +1,2 @@
[pytest]
testpaths = tests
@@ -0,0 +1,85 @@
#!/usr/bin/env python3
"""Smoke-test OpenVINO GenAI LLMPipeline on Intel NPU.
This verifies NPU execution by reading /sys/class/accel/accel0/device/npu_busy_time_us
before and after generation. HTTP 200/service success is not considered proof.
"""
from __future__ import annotations
import argparse
import json
import time
from pathlib import Path
from typing import Any
DEFAULT_MODEL = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
DEFAULT_CACHE = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
def import_openvino_genai() -> Any:
import openvino_genai as ov_genai # type: ignore[import-not-found]
return ov_genai
def read_busy(path: Path = BUSY_PATH) -> int:
return int(path.read_text().strip())
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--model", default=DEFAULT_MODEL)
parser.add_argument("--cache-dir", default=DEFAULT_CACHE)
parser.add_argument("--busy-path", default=str(BUSY_PATH))
parser.add_argument("--prompt", default="Write a concise title for: Synthetic NPU worker contract smoke with no routing changes.")
parser.add_argument("--max-new-tokens", type=int, default=24)
args = parser.parse_args()
model_path = Path(args.model)
cache_dir = Path(args.cache_dir)
busy_path = Path(args.busy_path)
cache_dir.mkdir(parents=True, exist_ok=True)
if not model_path.exists():
raise SystemExit(f"model path does not exist: {model_path}")
if not busy_path.exists():
raise SystemExit(f"NPU busy-time counter does not exist: {busy_path}")
if args.max_new_tokens < 1 or args.max_new_tokens > 256:
raise SystemExit("max-new-tokens must be between 1 and 256")
config = {
"CACHE_DIR": str(cache_dir),
"MAX_PROMPT_LEN": 1024,
"MIN_RESPONSE_LEN": 64,
"PREFILL_HINT": "DYNAMIC",
"GENERATE_HINT": "FAST_COMPILE",
}
ov_genai = import_openvino_genai()
before = read_busy(busy_path)
load_start = time.monotonic()
pipe = ov_genai.LLMPipeline(str(model_path), "NPU", **config)
load_ms = round((time.monotonic() - load_start) * 1000, 2)
gen_start = time.monotonic()
output = pipe.generate(args.prompt, max_new_tokens=args.max_new_tokens)
gen_ms = round((time.monotonic() - gen_start) * 1000, 2)
after = read_busy(busy_path)
result = {
"model": str(model_path),
"device": "NPU",
"cache_dir": str(cache_dir),
"prompt_chars": len(args.prompt),
"max_new_tokens": args.max_new_tokens,
"text": str(output).strip(),
"timing_ms": {"load": load_ms, "generate": gen_ms, "total": round(load_ms + gen_ms, 2)},
"npu_busy_before_us": before,
"npu_busy_after_us": after,
"npu_busy_delta_us": after - before,
}
print(json.dumps(result, indent=2))
return 0 if after > before else 2
if __name__ == "__main__":
raise SystemExit(main())
@@ -0,0 +1,17 @@
[Unit]
Description=OpenVINO GenAI NPU worker prototype
After=network-online.target
[Service]
Type=simple
WorkingDirectory=/home/will/lab/swarm/openvino-genai-npu-worker
Environment=OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
Environment=OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
Environment=OV_GENAI_NPU_HOST=127.0.0.1
Environment=OV_GENAI_NPU_PORT=18820
ExecStart=/home/will/.venvs/npu/bin/python /home/will/lab/swarm/openvino-genai-npu-worker/worker.py --host 127.0.0.1 --port 18820
Restart=on-failure
RestartSec=5
[Install]
WantedBy=default.target
@@ -0,0 +1,131 @@
from __future__ import annotations
import json
from pathlib import Path
import pytest
import worker
class FakePipeline:
def __init__(self, model_path: str, device: str, config: dict[str, object], busy_path: Path, output: str = "Synthetic title"):
self.model_path = model_path
self.device = device
self.config = config
self.busy_path = busy_path
self.output = output
self.calls: list[tuple[str, int]] = []
def generate(self, prompt: str, *, max_new_tokens: int):
self.calls.append((prompt, max_new_tokens))
before = int(self.busy_path.read_text().strip())
self.busy_path.write_text(str(before + 1234))
return self.output
class FakeGenAI:
def __init__(self, busy_path: Path, output: str = "Synthetic title"):
self.busy_path = busy_path
self.output = output
self.pipeline: FakePipeline | None = None
def LLMPipeline(self, model_path: str, device: str, *args: object, **kwargs: object): # noqa: N802 - mirrors OpenVINO API
if args and isinstance(args[0], dict):
config: dict[str, object] = {str(k): v for k, v in args[0].items()}
else:
config = dict(kwargs)
self.pipeline = FakePipeline(model_path, device, config, self.busy_path, self.output)
return self.pipeline
@pytest.fixture()
def worker_paths(tmp_path: Path):
model_path = tmp_path / "model"
cache_dir = tmp_path / "cache"
busy_path = tmp_path / "npu_busy_time_us"
model_path.mkdir()
busy_path.write_text("100")
return model_path, cache_dir, busy_path
def test_generate_uses_npu_config_and_reports_busy_delta(monkeypatch: pytest.MonkeyPatch, worker_paths):
model_path, cache_dir, busy_path = worker_paths
fake_genai = FakeGenAI(busy_path)
monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_port=18820)
result = npu_worker.generate("title", "Synthetic non-private kanban notification.", max_new_tokens=24)
assert result.npu_busy_before_us == 100
assert result.npu_busy_after_us == 1334
assert result.npu_busy_delta_us == 1234
assert result.text == "Synthetic title"
assert fake_genai.pipeline is not None
assert fake_genai.pipeline.device == "NPU"
assert fake_genai.pipeline.config["CACHE_DIR"] == str(cache_dir)
assert fake_genai.pipeline.config["MAX_PROMPT_LEN"] == 1024
assert fake_genai.pipeline.calls[0][1] == 24
def test_memory_alias_json_wrapping(monkeypatch: pytest.MonkeyPatch, worker_paths):
model_path, cache_dir, busy_path = worker_paths
fake_genai = FakeGenAI(busy_path, output='[{"fact":"synthetic stable preference","confidence":0.8}]')
monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
result = npu_worker.generate("memory_candidate", "Synthetic user says they prefer concise answers.")
assert result.parsed_json is not None
assert result.parsed_json["candidates"][0]["fact"] == "synthetic stable preference"
assert "wrapped" in result.parsed_json["notes"]
@pytest.mark.parametrize(
("job", "user_input", "max_new_tokens", "message"),
[
("bad", "hello", 1, "unsupported job"),
("title", "", 1, "non-empty"),
("title", "x" * (worker.MAX_INPUT_CHARS + 1), 1, "input too long"),
("title", "hello", worker.MAX_NEW_TOKENS + 1, "max_new_tokens"),
],
)
def test_validation_errors(monkeypatch: pytest.MonkeyPatch, worker_paths, job: str, user_input: str, max_new_tokens: int, message: str):
model_path, cache_dir, busy_path = worker_paths
monkeypatch.setattr(worker, "import_openvino_genai", lambda: FakeGenAI(busy_path))
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
with pytest.raises(ValueError, match=message):
npu_worker.generate(job, user_input, max_new_tokens=max_new_tokens)
def test_health_reports_actual_bind_and_limits(worker_paths):
model_path, cache_dir, busy_path = worker_paths
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_host="127.0.0.1", bind_port=18821)
health = npu_worker.health()
assert health["bind"] == "127.0.0.1:18821"
assert health["max_input_chars"] == 6000
assert health["max_new_tokens"] == 256
assert health["busy_time_us"] == 100
def test_response_payload_shape(worker_paths):
model_path, cache_dir, busy_path = worker_paths
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
result = worker.GenerationResult(
text="ok",
parsed_json={"severity": "info"},
timing_ms={"load": 1.0, "initial_load": 1.0, "generate": 2.0, "total": 3.0},
npu_busy_delta_us=5,
npu_busy_before_us=10,
npu_busy_after_us=15,
)
payload = worker.response_payload(npu_worker, "notification", result)
assert json.dumps(payload)
assert payload["device"] == "NPU"
assert payload["job"] == "notification"
assert payload["json"] == {"severity": "info"}
+289
View File
@@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""Local-only OpenVINO GenAI NPU worker.
Small bounded LLM worker for cheap background tasks. It intentionally does not
wire into Atlas/Hermes routing and does not log raw prompts by default.
"""
from __future__ import annotations
import argparse
import json
import os
import re
import socket
import threading
import time
from dataclasses import dataclass
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from pathlib import Path
from typing import Any, cast
from urllib.parse import urlparse
MODEL_ID = "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov"
DEFAULT_MODEL_PATH = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
DEFAULT_CACHE_DIR = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
HOST = "127.0.0.1"
PORT = 18820
MAX_INPUT_CHARS = 6000
MAX_NEW_TOKENS = 256
GENAI_CONFIG = {
"CACHE_DIR": DEFAULT_CACHE_DIR,
"MAX_PROMPT_LEN": 1024,
"MIN_RESPONSE_LEN": 64,
"PREFILL_HINT": "DYNAMIC",
"GENERATE_HINT": "FAST_COMPILE",
}
DEFAULTS = {
"title": 32,
"summary": 160,
"memory_candidate": 192,
"notification": 96,
}
PROMPTS = {
"title": "Write one concise title, 8 words or fewer. Return only the title.\n\nInput:\n{input}",
"summary": "Summarize the input in one short paragraph or up to 4 bullets. Be factual and concise.\n\nInput:\n{input}",
"memory_candidate": (
"Extract durable memory candidates from the conversation excerpt. "
"Return strict JSON with keys: candidates (array of objects with fact, confidence, reason), notes. "
"Do not write memory; only propose candidates.\n\nInput:\n{input}"
),
"notification": (
"Condense this notification or log excerpt for a human. "
"Return JSON with keys: severity (info|warning|error), category, summary, action_needed.\n\nInput:\n{input}"
),
}
def import_openvino_genai() -> Any:
"""Import OpenVINO GenAI lazily so unit tests do not require the NPU venv."""
import openvino_genai as ov_genai # type: ignore[import-not-found]
return ov_genai
def listener_exists(host: str, port: int) -> bool:
"""Return True when a TCP listener already accepts connections."""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.settimeout(0.2)
return sock.connect_ex((host, port)) == 0
def coerce_json(text: str) -> Any | None:
text = text.strip()
if not text:
return None
try:
return json.loads(text)
except json.JSONDecodeError:
match = re.search(r"(\{.*\}|\[.*\])", text, re.S)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
return None
@dataclass
class GenerationResult:
text: str
parsed_json: Any | None
timing_ms: dict[str, float]
npu_busy_delta_us: int
npu_busy_before_us: int
npu_busy_after_us: int
class NpuWorker:
def __init__(
self,
model_path: str,
cache_dir: str,
*,
busy_path: Path = BUSY_PATH,
bind_host: str = HOST,
bind_port: int = PORT,
):
self.model_path = Path(model_path)
self.cache_dir = Path(cache_dir)
self.busy_path = Path(busy_path)
self.bind_host = bind_host
self.bind_port = bind_port
self.cache_dir.mkdir(parents=True, exist_ok=True)
self._pipe = None
self._load_ms: float | None = None
self._lock = threading.Lock()
self._loaded_at: float | None = None
if not self.model_path.exists():
raise FileNotFoundError(f"model path does not exist: {self.model_path}")
if not self.busy_path.exists():
raise FileNotFoundError(f"NPU busy-time counter does not exist: {self.busy_path}")
def read_busy(self) -> int:
return int(self.busy_path.read_text().strip())
def load(self) -> None:
if self._pipe is not None:
return
start = time.monotonic()
# NPU GenAI requires bounded prompt/response shapes; CACHE_DIR enables compiled blob caching.
ov_genai = import_openvino_genai()
config = GENAI_CONFIG | {"CACHE_DIR": str(self.cache_dir)}
self._pipe = ov_genai.LLMPipeline(str(self.model_path), "NPU", **config)
self._load_ms = round((time.monotonic() - start) * 1000, 2)
self._loaded_at = time.time()
def generate(self, job: str, user_input: str, max_new_tokens: int | None = None) -> GenerationResult:
if job not in PROMPTS:
raise ValueError(f"unsupported job: {job}")
if not isinstance(user_input, str) or not user_input.strip():
raise ValueError("input must be a non-empty string")
if len(user_input) > MAX_INPUT_CHARS:
raise ValueError(f"input too long: {len(user_input)} chars > {MAX_INPUT_CHARS}")
max_new_tokens = int(max_new_tokens or DEFAULTS[job])
if max_new_tokens < 1 or max_new_tokens > MAX_NEW_TOKENS:
raise ValueError(f"max_new_tokens must be between 1 and {MAX_NEW_TOKENS}")
prompt = PROMPTS[job].format(input=user_input.strip())
with self._lock:
load_start = time.monotonic()
self.load()
load_ms = round((time.monotonic() - load_start) * 1000, 2)
before = self.read_busy()
gen_start = time.monotonic()
pipe = cast(Any, self._pipe)
text = str(pipe.generate(prompt, max_new_tokens=max_new_tokens)).strip()
generate_ms = round((time.monotonic() - gen_start) * 1000, 2)
after = self.read_busy()
parsed = coerce_json(text) if job in {"memory_candidate", "notification"} else None
if job == "memory_candidate" and isinstance(parsed, list):
parsed = {"candidates": parsed, "notes": "model returned a top-level array; worker wrapped it to preserve the API contract"}
return GenerationResult(
text=text,
parsed_json=parsed,
timing_ms={"load": load_ms, "initial_load": self._load_ms or 0.0, "generate": generate_ms, "total": round(load_ms + generate_ms, 2)},
npu_busy_delta_us=after - before,
npu_busy_before_us=before,
npu_busy_after_us=after,
)
def health(self) -> dict[str, Any]:
return {
"ok": True,
"model": MODEL_ID,
"model_path": str(self.model_path),
"device": "NPU",
"cache_dir": str(self.cache_dir),
"cache_exists": self.cache_dir.exists(),
"loaded": self._pipe is not None,
"initial_load_ms": self._load_ms,
"loaded_at": self._loaded_at,
"busy_time_us": self.read_busy(),
"max_input_chars": MAX_INPUT_CHARS,
"max_new_tokens": MAX_NEW_TOKENS,
"jobs": sorted(PROMPTS),
"bind": f"{self.bind_host}:{self.bind_port}",
}
def response_payload(worker: NpuWorker, job: str, result: GenerationResult) -> dict[str, Any]:
return {
"model": MODEL_ID,
"device": "NPU",
"job": job,
"text": result.text,
"json": result.parsed_json,
"timing_ms": result.timing_ms,
"npu_busy_delta_us": result.npu_busy_delta_us,
"npu_busy_before_us": result.npu_busy_before_us,
"npu_busy_after_us": result.npu_busy_after_us,
"cache_dir": str(worker.cache_dir),
}
def make_handler(worker: NpuWorker):
class Handler(BaseHTTPRequestHandler):
server_version = "openvino-genai-npu-worker/0.2"
def log_message(self, format: str, *args: Any) -> None:
# Log only method/path/status metadata, not raw request bodies.
print(f"{self.client_address[0]} {format % args}")
def send_json(self, status: int, payload: Any) -> None:
body = json.dumps(payload, indent=2).encode("utf-8")
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def do_GET(self) -> None: # noqa: N802
path = urlparse(self.path).path
if path == "/healthz":
self.send_json(200, worker.health())
elif path == "/models":
self.send_json(200, {"models": [{"id": MODEL_ID, "path": str(worker.model_path), "device": "NPU"}]})
else:
self.send_json(404, {"error": "not found"})
def do_POST(self) -> None: # noqa: N802
path = urlparse(self.path).path
route_job = {
"/v1/worker/generate": None,
"/v1/worker/extract-memory-candidates": "memory_candidate",
"/v1/worker/condense-notification": "notification",
}.get(path, "__missing__")
if route_job == "__missing__":
self.send_json(404, {"error": "not found"})
return
try:
length = int(self.headers.get("Content-Length", "0"))
payload = json.loads(self.rfile.read(length) or b"{}")
job = route_job or str(payload.get("job", "summary"))
if job == "memory":
job = "memory_candidate"
result = worker.generate(job, str(payload.get("input", "")), payload.get("max_new_tokens"))
body = response_payload(worker, job, result)
if result.npu_busy_delta_us <= 0:
body["error"] = "NPU busy-time counter did not increase during generation"
self.send_json(503, body)
return
self.send_json(200, body)
except Exception as exc:
self.send_json(400, {"error": str(exc)})
return Handler
def cli(argv: list[str] | None = None) -> int:
parser = argparse.ArgumentParser(description="OpenVINO GenAI NPU worker")
parser.add_argument("--model-path", default=os.environ.get("OV_GENAI_NPU_MODEL", DEFAULT_MODEL_PATH))
parser.add_argument("--cache-dir", default=os.environ.get("OV_GENAI_NPU_CACHE", DEFAULT_CACHE_DIR))
parser.add_argument("--host", default=os.environ.get("OV_GENAI_NPU_HOST", HOST))
parser.add_argument("--port", type=int, default=int(os.environ.get("OV_GENAI_NPU_PORT", PORT)))
parser.add_argument("--job", choices=sorted(PROMPTS), help="Run one CLI job instead of serving HTTP")
parser.add_argument("--input", help="Input text for --job")
parser.add_argument("--max-new-tokens", type=int)
args = parser.parse_args(argv)
if args.host != "127.0.0.1":
raise SystemExit("Refusing non-local bind without code change/explicit approval")
worker = NpuWorker(args.model_path, args.cache_dir, bind_host=args.host, bind_port=args.port)
if args.job:
result = worker.generate(args.job, args.input or "", args.max_new_tokens)
print(json.dumps(response_payload(worker, args.job, result), indent=2))
return 0 if result.npu_busy_delta_us > 0 else 2
if listener_exists(args.host, args.port):
raise SystemExit(f"Refusing to start: listener already exists on {args.host}:{args.port}")
server = ThreadingHTTPServer((args.host, args.port), make_handler(worker))
print(f"serving {MODEL_ID} on http://{args.host}:{args.port}; raw prompts are not logged")
server.serve_forever()
return 0
if __name__ == "__main__":
raise SystemExit(cli())