Model: OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov.
Runtime: /home/will/.venvs/npu with openvino-genai==2026.2.0.0.
Device: OpenVINO GenAI NPU.
Default bind: 127.0.0.1:18820.
Jobs: title, summary, notification, memory_candidate.
Prompt/input limits: 6000 chars, MAX_PROMPT_LEN=1024, max 256 generated tokens.

The worker does not write memory, does not restart Atlas/Hermes, does not change primary routing, and does not log raw prompt bodies by default.

Files

CONTRACT.md — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
worker.py — stdlib HTTP API plus CLI wrapper.
smoke_llm_npu.py — direct GenAI smoke test with NPU busy-time verification.
tests/test_worker.py — unit tests with a fake GenAI pipeline and synthetic busy-time counter.
systemd/openvino-genai-npu-worker.service — optional user-service template; not installed by this prototype.

Model/cache

Downloaded model path:

/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov

OpenVINO compile cache path:

/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4

NPU pipeline config used by the prototype:

CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
MAX_PROMPT_LEN=1024
MIN_RESPONSE_LEN=64
PREFILL_HINT=DYNAMIC
GENERATE_HINT=FAST_COMPILE

AOT/blob note: first milestone uses CACHE_DIR only. Do not switch to manual EXPORT_BLOB/BLOB_PATH until compile latency is proven to be the bottleneck. If explicit blobs are used later, record OpenVINO version, NPU compiler version, driver version, model id, quantization flags, and source weights path; invalidate blobs after OpenVINO/NPU driver upgrades.

Direct smoke test

cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python smoke_llm_npu.py

Acceptance requires npu_busy_delta_us > 0.

Observed cold-ish smoke after download/cache setup:

{
  "text": "\"Atlas Summarizes NPU Worker Options Requested by User\"",
  "timing_ms": {"load": 10989.08, "generate": 3157.94, "total": 14147.02},
  "npu_busy_delta_us": 2650724
}

CLI usage

/home/will/.venvs/npu/bin/python worker.py \
  --job title \
  --input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'

Exit code is non-zero if validation fails, generation fails, or the worker-reported npu_busy_delta_us is not positive.

HTTP usage

Start locally only:

cd /home/will/lab/swarm/openvino-genai-npu-worker
ss -ltnp | grep ':18820' && { echo 'port 18820 already in use'; exit 1; } || true
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820

The server also refuses startup if a listener is already accepting connections on 127.0.0.1:18820.

Endpoints:

GET  /healthz
GET  /models
POST /v1/worker/generate
POST /v1/worker/extract-memory-candidates
POST /v1/worker/condense-notification

Example:

curl -s http://127.0.0.1:18820/v1/worker/generate \
  -H 'Content-Type: application/json' \
  -d '{"job":"summary","input":"Build a bounded local NPU worker for small generation tasks, no primary routing changes.","max_new_tokens":80}' \
  | python -m json.tool

Response includes npu_busy_delta_us; treat zero as failure even if HTTP status is 200.

Unit tests

These tests use only synthetic strings and a fake GenAI pipeline, so they do not load the model or touch private data:

cd /home/will/lab/swarm/openvino-genai-npu-worker
python -m pytest -q

Environment variables

OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
OV_GENAI_NPU_HOST=127.0.0.1
OV_GENAI_NPU_PORT=18820

Only 127.0.0.1 is accepted by the current prototype; wider binds require an explicit code change and approval.

Optional systemd user service

A draft unit exists at systemd/openvino-genai-npu-worker.service for later review. Do not copy, enable, or autostart it unless Will explicitly approves persistent service enablement. Foreground smoke on 127.0.0.1:18820 plus positive sysfs NPU busy-time delta is required before any installation discussion.

Safety boundaries

Binds only to 127.0.0.1 by default; non-local bind is refused in code.
No raw request-body logging.
No private external uploads.
No Atlas/Hermes gateway restarts or primary model routing changes.
NPU access is serialized with a process lock because the NPU is a shared resource with existing services.