152 lines
5.0 KiB
Markdown
152 lines
5.0 KiB
Markdown
# OpenVINO GenAI NPU worker prototype
|
|
|
|
Local-only prototype for cheap bounded background generation on Will's Intel NPU. It is intentionally isolated from primary Atlas/Hermes routing.
|
|
|
|
## What it does
|
|
|
|
- Model: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`.
|
|
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`.
|
|
- Device: OpenVINO GenAI `NPU`.
|
|
- Default bind: `127.0.0.1:18820`.
|
|
- Jobs: `title`, `summary`, `notification`, `memory_candidate`.
|
|
- Prompt/input limits: 6000 chars, `MAX_PROMPT_LEN=1024`, max 256 generated tokens.
|
|
|
|
The worker does not write memory, does not restart Atlas/Hermes, does not change primary routing, and does not log raw prompt bodies by default.
|
|
|
|
## Files
|
|
|
|
- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
|
|
- `worker.py` — stdlib HTTP API plus CLI wrapper.
|
|
- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
|
|
- `tests/test_worker.py` — unit tests with a fake GenAI pipeline and synthetic busy-time counter.
|
|
- `systemd/openvino-genai-npu-worker.service` — reviewed local-only user-service template for `127.0.0.1:18820`.
|
|
|
|
## Model/cache
|
|
|
|
Downloaded model path:
|
|
|
|
```text
|
|
/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
|
|
```
|
|
|
|
OpenVINO compile cache path:
|
|
|
|
```text
|
|
/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
|
```
|
|
|
|
NPU pipeline config used by the prototype:
|
|
|
|
```python
|
|
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
|
MAX_PROMPT_LEN=1024
|
|
MIN_RESPONSE_LEN=64
|
|
PREFILL_HINT=DYNAMIC
|
|
GENERATE_HINT=FAST_COMPILE
|
|
```
|
|
|
|
AOT/blob note: first milestone uses `CACHE_DIR` only. Do not switch to manual `EXPORT_BLOB`/`BLOB_PATH` until compile latency is proven to be the bottleneck. If explicit blobs are used later, record OpenVINO version, NPU compiler version, driver version, model id, quantization flags, and source weights path; invalidate blobs after OpenVINO/NPU driver upgrades.
|
|
|
|
## Direct smoke test
|
|
|
|
```bash
|
|
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
|
/home/will/.venvs/npu/bin/python smoke_llm_npu.py
|
|
```
|
|
|
|
Acceptance requires `npu_busy_delta_us > 0`.
|
|
|
|
Observed cold-ish smoke after download/cache setup:
|
|
|
|
```json
|
|
{
|
|
"text": "\"Atlas Summarizes NPU Worker Options Requested by User\"",
|
|
"timing_ms": {"load": 10989.08, "generate": 3157.94, "total": 14147.02},
|
|
"npu_busy_delta_us": 2650724
|
|
}
|
|
```
|
|
|
|
## CLI usage
|
|
|
|
```bash
|
|
/home/will/.venvs/npu/bin/python worker.py \
|
|
--job title \
|
|
--input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'
|
|
```
|
|
|
|
Exit code is non-zero if validation fails, generation fails, or the worker-reported `npu_busy_delta_us` is not positive.
|
|
|
|
## HTTP usage
|
|
|
|
Start locally only:
|
|
|
|
```bash
|
|
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
|
ss -ltnp | grep ':18820' && { echo 'port 18820 already in use'; exit 1; } || true
|
|
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
|
|
```
|
|
|
|
The server also refuses startup if a listener is already accepting connections on `127.0.0.1:18820`.
|
|
|
|
Endpoints:
|
|
|
|
```text
|
|
GET /healthz
|
|
GET /models
|
|
POST /v1/worker/generate
|
|
POST /v1/worker/extract-memory-candidates
|
|
POST /v1/worker/condense-notification
|
|
```
|
|
|
|
Example:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:18820/v1/worker/generate \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"job":"summary","input":"Build a bounded local NPU worker for small generation tasks, no primary routing changes.","max_new_tokens":80}' \
|
|
| python -m json.tool
|
|
```
|
|
|
|
Response includes `npu_busy_delta_us`; treat zero as failure even if HTTP status is 200.
|
|
|
|
## Unit tests
|
|
|
|
These tests use only synthetic strings and a fake GenAI pipeline, so they do not load the model or touch private data:
|
|
|
|
```bash
|
|
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
|
python -m pytest -q
|
|
```
|
|
|
|
## Environment variables
|
|
|
|
```text
|
|
OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
|
|
OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
|
OV_GENAI_NPU_HOST=127.0.0.1
|
|
OV_GENAI_NPU_PORT=18820
|
|
```
|
|
|
|
Only `127.0.0.1` is accepted by the current prototype; wider binds require an explicit code change and approval.
|
|
|
|
## Systemd user service
|
|
|
|
A reviewed local-only unit exists at `systemd/openvino-genai-npu-worker.service` for persistent background use after foreground smoke succeeds with a positive NPU busy-time delta:
|
|
|
|
```bash
|
|
install -m 0644 systemd/openvino-genai-npu-worker.service ~/.config/systemd/user/openvino-genai-npu-worker.service
|
|
systemctl --user daemon-reload
|
|
systemctl --user enable --now openvino-genai-npu-worker.service
|
|
systemctl --user status openvino-genai-npu-worker.service --no-pager
|
|
```
|
|
|
|
The service remains isolated: do not route primary Atlas/Hermes chat, gateway output, or automatic memory writes to it without a separate approved integration.
|
|
|
|
## Safety boundaries
|
|
|
|
- Binds only to `127.0.0.1` by default; non-local bind is refused in code.
|
|
- No raw request-body logging.
|
|
- No private external uploads.
|
|
- No Atlas/Hermes gateway restarts or primary model routing changes.
|
|
- NPU access is serialized with a process lock because the NPU is a shared resource with existing services.
|