swarm-master/openvino-genai-npu-worker/README.md

# OpenVINO GenAI NPU worker prototype

Local-only prototype for cheap bounded background generation on Will's Intel NPU. It is intentionally isolated from primary Atlas/Hermes routing.

## What it does

- Model: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`.
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`.
- Device: OpenVINO GenAI `NPU`.
- Default bind: `127.0.0.1:18820`.
- Jobs: `title`, `summary`, `notification`, `memory_candidate`.
- Prompt/input limits: 6000 chars, `MAX_PROMPT_LEN=1024`, max 256 generated tokens.

The worker does not write memory, does not restart Atlas/Hermes, does not change primary routing, and does not log raw prompt bodies by default.

## Files

- `worker.py` — stdlib HTTP API plus CLI wrapper.
- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
- `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.

## Model/cache

Downloaded model path:

```text
/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
```

OpenVINO compile cache path:

```text
/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
```

NPU pipeline config used by the prototype:

```python
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
MAX_PROMPT_LEN=1024
MIN_RESPONSE_LEN=64
PREFILL_HINT=DYNAMIC
GENERATE_HINT=FAST_COMPILE
```

AOT/blob note: first milestone uses `CACHE_DIR` only. Do not switch to manual `EXPORT_BLOB`/`BLOB_PATH` until compile latency is proven to be the bottleneck. If explicit blobs are used later, record OpenVINO version, NPU compiler version, driver version, model id, quantization flags, and source weights path; invalidate blobs after OpenVINO/NPU driver upgrades.

## Direct smoke test

```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python smoke_llm_npu.py
```

Acceptance requires `npu_busy_delta_us > 0`.

Observed cold-ish smoke after download/cache setup:

```json
{
  "text": "\"Atlas Summarizes NPU Worker Options Requested by User\"",
  "timing_ms": {"load": 10989.08, "generate": 3157.94, "total": 14147.02},
  "npu_busy_delta_us": 2650724
}
```

## CLI usage

```bash
/home/will/.venvs/npu/bin/python worker.py \
  --job title \
  --input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'
```

## HTTP usage

Start locally only:

```bash
cd /home/will/lab/swarm/openvino-genai-npu-worker
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
```

Endpoints:

```text
GET  /healthz
GET  /models
POST /v1/worker/generate
POST /v1/worker/extract-memory-candidates
POST /v1/worker/condense-notification
```

Example:

```bash
curl -s http://127.0.0.1:18820/v1/worker/generate \
  -H 'Content-Type: application/json' \
  -d '{"job":"summary","input":"Build a bounded local NPU worker for small generation tasks, no primary routing changes.","max_new_tokens":80}' \
  | python -m json.tool
```

Response includes `npu_busy_delta_us`; treat zero as failure even if HTTP status is 200.

## Optional systemd user service

A draft unit exists at `systemd/openvino-genai-npu-worker.service` for later review. Do not copy, enable, or autostart it unless Will explicitly approves persistent service enablement. Foreground smoke on `127.0.0.1:18820` plus positive sysfs NPU busy-time delta is required before any installation discussion.

## Safety boundaries

- Binds only to `127.0.0.1` by default; non-local bind is refused in code.
- No raw request-body logging.
- No private external uploads.
- No Atlas/Hermes gateway restarts or primary model routing changes.
- NPU access is serialized with a process lock because the NPU is a shared resource with existing services.