feat(npu): add bounded OpenVINO GenAI worker

2026-06-04 13:07:51 -07:00
parent d3373e7234
commit 2ef9e3dfd2
7 changed files with 972 additions and 0 deletions
@@ -0,0 +1,306 @@
+# Bounded OpenVINO GenAI NPU worker contract
+
+Status: prototype contract implemented locally; not a live Atlas/Hermes routing dependency.
+Default address: `http://127.0.0.1:18820`.
+
+## Purpose and hard boundary
+
+This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate.
+
+Hard boundaries:
+
+- Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change.
+- Do not enable a persistent systemd/Docker service as part of smoke testing.
+- Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will.
+- Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories.
+- Do not log raw prompts or raw request bodies by default.
+- Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation.
+
+## Recommended model/runtime
+
+Recommended first model:
+
+- Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`
+- Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov`
+- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`
+- Device: OpenVINO GenAI `NPU`
+- Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4`
+
+Why this model/runtime:
+
+- It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta.
+- It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU.
+- Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model.
+- The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline.
+- OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls.
+
+Deferred alternatives:
+
+- Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention.
+- CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate.
+- Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades.
+
+## Runtime bounds
+
+Pipeline configuration for the first milestone:
+
+```text
+CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
+MAX_PROMPT_LEN=1024
+MIN_RESPONSE_LEN=64
+PREFILL_HINT=DYNAMIC
+GENERATE_HINT=FAST_COMPILE
+```
+
+Request bounds:
+
+- `input`: required non-empty string; max `6000` characters before prompt templating.
+- `job`: one of `title`, `summary`, `notification`, `memory_candidate`.
+- `max_new_tokens`: optional; default by job; hard max `256`.
+- Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars.
+- Logging: log method/path/status and timing only; never log raw `input` or generated text by default.
+
+Expected latency target:
+
+- Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model.
+- Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`.
+- Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings.
+
+These are prototype acceptance targets, not SLOs for live Atlas routing.
+
+## CLI contract
+
+Command shape:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+/home/will/.venvs/npu/bin/python worker.py \
+  --job title \
+  --input 'Synthetic non-private text to title.' \
+  --max-new-tokens 32
+```
+
+CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be:
+
+- `0` when the job succeeds and `npu_busy_delta_us > 0`.
+- non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive.
+
+The CLI must not write memory, change service routing, or start persistent services.
+
+## HTTP contract
+
+Start temporary local server only:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
+```
+
+Endpoints:
+
+```text
+GET  /healthz
+GET  /models
+POST /v1/worker/generate
+POST /v1/worker/extract-memory-candidates
+POST /v1/worker/condense-notification
+```
+
+`GET /healthz` response fields:
+
+```json
+{
+  "ok": true,
+  "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
+  "model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov",
+  "device": "NPU",
+  "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4",
+  "cache_exists": true,
+  "loaded": false,
+  "initial_load_ms": null,
+  "busy_time_us": 0,
+  "max_input_chars": 6000,
+  "jobs": ["memory_candidate", "notification", "summary", "title"],
+  "bind": "127.0.0.1:18820"
+}
+```
+
+`POST /v1/worker/generate` request:
+
+```json
+{
+  "job": "summary",
+  "input": "Synthetic non-private text to summarize.",
+  "max_new_tokens": 80
+}
+```
+
+Specialized aliases:
+
+- `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`.
+- `POST /v1/worker/condense-notification` implies `job=notification`.
+- Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`.
+
+Successful generation response:
+
+```json
+{
+  "model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
+  "device": "NPU",
+  "job": "summary",
+  "text": "...",
+  "json": null,
+  "timing_ms": {
+    "load": 0.0,
+    "initial_load": 10989.08,
+    "generate": 3157.94,
+    "total": 3157.94
+  },
+  "npu_busy_delta_us": 2650724,
+  "npu_busy_before_us": 123,
+  "npu_busy_after_us": 2650847,
+  "cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
+}
+```
+
+Validation/error behavior:
+
+- Unsupported path: `404` JSON `{"error":"not found"}`.
+- Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors.
+- If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`; the refreshed prototype returns `503` with the generation payload plus an `error` field.
+
+## Prompt/job contract
+
+`title`:
+
+- Input: short task/log/message excerpt.
+- Output: one title, 8 words or fewer, no markdown required.
+- Default `max_new_tokens`: 32.
+
+`summary`:
+
+- Input: synthetic/non-private text excerpt.
+- Output: one short paragraph or up to 4 bullets.
+- Default `max_new_tokens`: 160.
+
+`notification`:
+
+- Input: synthetic/non-private alert/log excerpt.
+- Output target: JSON object with `severity`, `category`, `summary`, `action_needed`.
+- Default `max_new_tokens`: 96.
+- Client must tolerate `json: null` and parse/validate before using output.
+
+`memory_candidate`:
+
+- Input: synthetic/non-private conversation excerpt.
+- Output target: JSON object with `candidates` and `notes`; candidates are proposals only.
+- Default `max_new_tokens`: 192.
+- This worker must never call Hermes memory tools or write durable memory directly.
+
+## Smoke-test plan using non-private data
+
+Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this:
+
+```text
+Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed.
+```
+
+Direct NPU smoke:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
+/home/will/.venvs/npu/bin/python smoke_llm_npu.py \
+  --prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \
+  --max-new-tokens 24
+status=$?
+after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
+printf 'external_busy_delta_us=%s\n' "$((after-before))"
+test "$status" -eq 0
+test "$((after-before))" -gt 0
+```
+
+Temporary HTTP smoke:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 &
+pid=$!
+trap 'kill "$pid" 2>/dev/null || true' EXIT
+
+curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool
+before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
+curl -fsS http://127.0.0.1:18820/v1/worker/generate \
+  -H 'Content-Type: application/json' \
+  -d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \
+  | tee /tmp/openvino-genai-worker-smoke.json \
+  | python -m json.tool
+after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
+python - <<'PY'
+import json
+p=json.load(open('/tmp/openvino-genai-worker-smoke.json'))
+assert p['npu_busy_delta_us'] > 0, p
+assert p['device'] == 'NPU', p
+PY
+test "$((after-before))" -gt 0
+kill "$pid"
+trap - EXIT
+```
+
+Also verify the temporary listener is gone:
+
+```bash
+ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true
+```
+
+Unit tests that do not load the model or require private data:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+python -m pytest -q
+```
+
+## NPU busy-time verification plan
+
+Acceptance for any NPU claim requires all of the following:
+
+1. Confirm the sysfs counter exists and is readable:
+   `test -r /sys/class/accel/accel0/device/npu_busy_time_us`.
+2. Read `busy_before` immediately before the generation call.
+3. Run exactly one bounded generation against the candidate worker.
+4. Read `busy_after` immediately after generation completes.
+5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`.
+6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff.
+7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated.
+
+Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter.
+
+## Docs/diagram implications
+
+If this worker is kept as a prototype, docs and diagrams should show:
+
+- Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`.
+- GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled.
+- No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists.
+- Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks.
+- Service maps should label this as "bounded background generation" rather than "chat" or "assistant model".
+
+## Explicit no-go / defer criteria
+
+No-go for implementation or promotion:
+
+- Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable.
+- `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation.
+- Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services.
+- The worker needs private documents/images/chat logs for smoke testing.
+- The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value.
+- The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs.
+- The service logs raw prompt bodies by default.
+- Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff.
+
+Defer, do not solve in this lane:
+
+- Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration.
+- RAG query rewriting, RAG answer generation, or collection mutation.
+- Private document/image triage.
+- Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.
@@ -0,0 +1,142 @@
+# OpenVINO GenAI NPU worker prototype
+
+Local-only prototype for cheap bounded background generation on Will's Intel NPU. It is intentionally isolated from primary Atlas/Hermes routing.
+
+## What it does
+
+- Model: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`.
+- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`.
+- Device: OpenVINO GenAI `NPU`.
+- Default bind: `127.0.0.1:18820`.
+- Jobs: `title`, `summary`, `notification`, `memory_candidate`.
+- Prompt/input limits: 6000 chars, `MAX_PROMPT_LEN=1024`, max 256 generated tokens.
+
+The worker does not write memory, does not restart Atlas/Hermes, does not change primary routing, and does not log raw prompt bodies by default.
+
+## Files
+
+- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
+- `worker.py` — stdlib HTTP API plus CLI wrapper.
+- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
+- `tests/test_worker.py` — unit tests with a fake GenAI pipeline and synthetic busy-time counter.
+- `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.
+
+## Model/cache
+
+Downloaded model path:
+
+```text
+/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
+```
+
+OpenVINO compile cache path:
+
+```text
+/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
+```
+
+NPU pipeline config used by the prototype:
+
+```python
+CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
+MAX_PROMPT_LEN=1024
+MIN_RESPONSE_LEN=64
+PREFILL_HINT=DYNAMIC
+GENERATE_HINT=FAST_COMPILE
+```
+
+AOT/blob note: first milestone uses `CACHE_DIR` only. Do not switch to manual `EXPORT_BLOB`/`BLOB_PATH` until compile latency is proven to be the bottleneck. If explicit blobs are used later, record OpenVINO version, NPU compiler version, driver version, model id, quantization flags, and source weights path; invalidate blobs after OpenVINO/NPU driver upgrades.
+
+## Direct smoke test
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+/home/will/.venvs/npu/bin/python smoke_llm_npu.py
+```
+
+Acceptance requires `npu_busy_delta_us > 0`.
+
+Observed cold-ish smoke after download/cache setup:
+
+```json
+{
+  "text": "\"Atlas Summarizes NPU Worker Options Requested by User\"",
+  "timing_ms": {"load": 10989.08, "generate": 3157.94, "total": 14147.02},
+  "npu_busy_delta_us": 2650724
+}
+```
+
+## CLI usage
+
+```bash
+/home/will/.venvs/npu/bin/python worker.py \
+  --job title \
+  --input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'
+```
+
+Exit code is non-zero if validation fails, generation fails, or the worker-reported `npu_busy_delta_us` is not positive.
+
+## HTTP usage
+
+Start locally only:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+ss -ltnp | grep ':18820' && { echo 'port 18820 already in use'; exit 1; } || true
+/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
+```
+
+The server also refuses startup if a listener is already accepting connections on `127.0.0.1:18820`.
+
+Endpoints:
+
+```text
+GET  /healthz
+GET  /models
+POST /v1/worker/generate
+POST /v1/worker/extract-memory-candidates
+POST /v1/worker/condense-notification
+```
+
+Example:
+
+```bash
+curl -s http://127.0.0.1:18820/v1/worker/generate \
+  -H 'Content-Type: application/json' \
+  -d '{"job":"summary","input":"Build a bounded local NPU worker for small generation tasks, no primary routing changes.","max_new_tokens":80}' \
+  | python -m json.tool
+```
+
+Response includes `npu_busy_delta_us`; treat zero as failure even if HTTP status is 200.
+
+## Unit tests
+
+These tests use only synthetic strings and a fake GenAI pipeline, so they do not load the model or touch private data:
+
+```bash
+cd /home/will/lab/swarm/openvino-genai-npu-worker
+python -m pytest -q
+```
+
+## Environment variables
+
+```text
+OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
+OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
+OV_GENAI_NPU_HOST=127.0.0.1
+OV_GENAI_NPU_PORT=18820
+```
+
+Only `127.0.0.1` is accepted by the current prototype; wider binds require an explicit code change and approval.
+
+## Optional systemd user service
+
+A draft unit exists at `systemd/openvino-genai-npu-worker.service` for later review. Do not copy, enable, or autostart it unless Will explicitly approves persistent service enablement. Foreground smoke on `127.0.0.1:18820` plus positive sysfs NPU busy-time delta is required before any installation discussion.
+
+## Safety boundaries
+
+- Binds only to `127.0.0.1` by default; non-local bind is refused in code.
+- No raw request-body logging.
+- No private external uploads.
+- No Atlas/Hermes gateway restarts or primary model routing changes.
+- NPU access is serialized with a process lock because the NPU is a shared resource with existing services.
@@ -0,0 +1,2 @@
+[pytest]
+testpaths = tests
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+"""Smoke-test OpenVINO GenAI LLMPipeline on Intel NPU.
+
+This verifies NPU execution by reading /sys/class/accel/accel0/device/npu_busy_time_us
+before and after generation. HTTP 200/service success is not considered proof.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import time
+from pathlib import Path
+from typing import Any
+
+DEFAULT_MODEL = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
+DEFAULT_CACHE = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
+BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
+
+
+def import_openvino_genai() -> Any:
+    import openvino_genai as ov_genai  # type: ignore[import-not-found]
+
+    return ov_genai
+
+
+def read_busy(path: Path = BUSY_PATH) -> int:
+    return int(path.read_text().strip())
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", default=DEFAULT_MODEL)
+    parser.add_argument("--cache-dir", default=DEFAULT_CACHE)
+    parser.add_argument("--busy-path", default=str(BUSY_PATH))
+    parser.add_argument("--prompt", default="Write a concise title for: Synthetic NPU worker contract smoke with no routing changes.")
+    parser.add_argument("--max-new-tokens", type=int, default=24)
+    args = parser.parse_args()
+
+    model_path = Path(args.model)
+    cache_dir = Path(args.cache_dir)
+    busy_path = Path(args.busy_path)
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    if not model_path.exists():
+        raise SystemExit(f"model path does not exist: {model_path}")
+    if not busy_path.exists():
+        raise SystemExit(f"NPU busy-time counter does not exist: {busy_path}")
+    if args.max_new_tokens < 1 or args.max_new_tokens > 256:
+        raise SystemExit("max-new-tokens must be between 1 and 256")
+
+    config = {
+        "CACHE_DIR": str(cache_dir),
+        "MAX_PROMPT_LEN": 1024,
+        "MIN_RESPONSE_LEN": 64,
+        "PREFILL_HINT": "DYNAMIC",
+        "GENERATE_HINT": "FAST_COMPILE",
+    }
+
+    ov_genai = import_openvino_genai()
+    before = read_busy(busy_path)
+    load_start = time.monotonic()
+    pipe = ov_genai.LLMPipeline(str(model_path), "NPU", **config)
+    load_ms = round((time.monotonic() - load_start) * 1000, 2)
+
+    gen_start = time.monotonic()
+    output = pipe.generate(args.prompt, max_new_tokens=args.max_new_tokens)
+    gen_ms = round((time.monotonic() - gen_start) * 1000, 2)
+    after = read_busy(busy_path)
+    result = {
+        "model": str(model_path),
+        "device": "NPU",
+        "cache_dir": str(cache_dir),
+        "prompt_chars": len(args.prompt),
+        "max_new_tokens": args.max_new_tokens,
+        "text": str(output).strip(),
+        "timing_ms": {"load": load_ms, "generate": gen_ms, "total": round(load_ms + gen_ms, 2)},
+        "npu_busy_before_us": before,
+        "npu_busy_after_us": after,
+        "npu_busy_delta_us": after - before,
+    }
+    print(json.dumps(result, indent=2))
+    return 0 if after > before else 2
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -0,0 +1,17 @@
+[Unit]
+Description=OpenVINO GenAI NPU worker prototype
+After=network-online.target
+
+[Service]
+Type=simple
+WorkingDirectory=/home/will/lab/swarm/openvino-genai-npu-worker
+Environment=OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
+Environment=OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
+Environment=OV_GENAI_NPU_HOST=127.0.0.1
+Environment=OV_GENAI_NPU_PORT=18820
+ExecStart=/home/will/.venvs/npu/bin/python /home/will/lab/swarm/openvino-genai-npu-worker/worker.py --host 127.0.0.1 --port 18820
+Restart=on-failure
+RestartSec=5
+
+[Install]
+WantedBy=default.target
@@ -0,0 +1,131 @@
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pytest
+
+import worker
+
+
+class FakePipeline:
+    def __init__(self, model_path: str, device: str, config: dict[str, object], busy_path: Path, output: str = "Synthetic title"):
+        self.model_path = model_path
+        self.device = device
+        self.config = config
+        self.busy_path = busy_path
+        self.output = output
+        self.calls: list[tuple[str, int]] = []
+
+    def generate(self, prompt: str, *, max_new_tokens: int):
+        self.calls.append((prompt, max_new_tokens))
+        before = int(self.busy_path.read_text().strip())
+        self.busy_path.write_text(str(before + 1234))
+        return self.output
+
+
+class FakeGenAI:
+    def __init__(self, busy_path: Path, output: str = "Synthetic title"):
+        self.busy_path = busy_path
+        self.output = output
+        self.pipeline: FakePipeline | None = None
+
+    def LLMPipeline(self, model_path: str, device: str, *args: object, **kwargs: object):  # noqa: N802 - mirrors OpenVINO API
+        if args and isinstance(args[0], dict):
+            config: dict[str, object] = {str(k): v for k, v in args[0].items()}
+        else:
+            config = dict(kwargs)
+        self.pipeline = FakePipeline(model_path, device, config, self.busy_path, self.output)
+        return self.pipeline
+
+
+@pytest.fixture()
+def worker_paths(tmp_path: Path):
+    model_path = tmp_path / "model"
+    cache_dir = tmp_path / "cache"
+    busy_path = tmp_path / "npu_busy_time_us"
+    model_path.mkdir()
+    busy_path.write_text("100")
+    return model_path, cache_dir, busy_path
+
+
+def test_generate_uses_npu_config_and_reports_busy_delta(monkeypatch: pytest.MonkeyPatch, worker_paths):
+    model_path, cache_dir, busy_path = worker_paths
+    fake_genai = FakeGenAI(busy_path)
+    monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
+
+    npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_port=18820)
+    result = npu_worker.generate("title", "Synthetic non-private kanban notification.", max_new_tokens=24)
+
+    assert result.npu_busy_before_us == 100
+    assert result.npu_busy_after_us == 1334
+    assert result.npu_busy_delta_us == 1234
+    assert result.text == "Synthetic title"
+    assert fake_genai.pipeline is not None
+    assert fake_genai.pipeline.device == "NPU"
+    assert fake_genai.pipeline.config["CACHE_DIR"] == str(cache_dir)
+    assert fake_genai.pipeline.config["MAX_PROMPT_LEN"] == 1024
+    assert fake_genai.pipeline.calls[0][1] == 24
+
+
+def test_memory_alias_json_wrapping(monkeypatch: pytest.MonkeyPatch, worker_paths):
+    model_path, cache_dir, busy_path = worker_paths
+    fake_genai = FakeGenAI(busy_path, output='[{"fact":"synthetic stable preference","confidence":0.8}]')
+    monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
+
+    npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
+    result = npu_worker.generate("memory_candidate", "Synthetic user says they prefer concise answers.")
+
+    assert result.parsed_json is not None
+    assert result.parsed_json["candidates"][0]["fact"] == "synthetic stable preference"
+    assert "wrapped" in result.parsed_json["notes"]
+
+
+@pytest.mark.parametrize(
+    ("job", "user_input", "max_new_tokens", "message"),
+    [
+        ("bad", "hello", 1, "unsupported job"),
+        ("title", "", 1, "non-empty"),
+        ("title", "x" * (worker.MAX_INPUT_CHARS + 1), 1, "input too long"),
+        ("title", "hello", worker.MAX_NEW_TOKENS + 1, "max_new_tokens"),
+    ],
+)
+def test_validation_errors(monkeypatch: pytest.MonkeyPatch, worker_paths, job: str, user_input: str, max_new_tokens: int, message: str):
+    model_path, cache_dir, busy_path = worker_paths
+    monkeypatch.setattr(worker, "import_openvino_genai", lambda: FakeGenAI(busy_path))
+    npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
+
+    with pytest.raises(ValueError, match=message):
+        npu_worker.generate(job, user_input, max_new_tokens=max_new_tokens)
+
+
+def test_health_reports_actual_bind_and_limits(worker_paths):
+    model_path, cache_dir, busy_path = worker_paths
+    npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_host="127.0.0.1", bind_port=18821)
+
+    health = npu_worker.health()
+
+    assert health["bind"] == "127.0.0.1:18821"
+    assert health["max_input_chars"] == 6000
+    assert health["max_new_tokens"] == 256
+    assert health["busy_time_us"] == 100
+
+
+def test_response_payload_shape(worker_paths):
+    model_path, cache_dir, busy_path = worker_paths
+    npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
+    result = worker.GenerationResult(
+        text="ok",
+        parsed_json={"severity": "info"},
+        timing_ms={"load": 1.0, "initial_load": 1.0, "generate": 2.0, "total": 3.0},
+        npu_busy_delta_us=5,
+        npu_busy_before_us=10,
+        npu_busy_after_us=15,
+    )
+
+    payload = worker.response_payload(npu_worker, "notification", result)
+
+    assert json.dumps(payload)
+    assert payload["device"] == "NPU"
+    assert payload["job"] == "notification"
+    assert payload["json"] == {"severity": "info"}
@@ -0,0 +1,289 @@
+#!/usr/bin/env python3
+"""Local-only OpenVINO GenAI NPU worker.
+
+Small bounded LLM worker for cheap background tasks. It intentionally does not
+wire into Atlas/Hermes routing and does not log raw prompts by default.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import re
+import socket
+import threading
+import time
+from dataclasses import dataclass
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from pathlib import Path
+from typing import Any, cast
+from urllib.parse import urlparse
+
+MODEL_ID = "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov"
+DEFAULT_MODEL_PATH = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
+DEFAULT_CACHE_DIR = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
+BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
+HOST = "127.0.0.1"
+PORT = 18820
+MAX_INPUT_CHARS = 6000
+MAX_NEW_TOKENS = 256
+GENAI_CONFIG = {
+    "CACHE_DIR": DEFAULT_CACHE_DIR,
+    "MAX_PROMPT_LEN": 1024,
+    "MIN_RESPONSE_LEN": 64,
+    "PREFILL_HINT": "DYNAMIC",
+    "GENERATE_HINT": "FAST_COMPILE",
+}
+DEFAULTS = {
+    "title": 32,
+    "summary": 160,
+    "memory_candidate": 192,
+    "notification": 96,
+}
+PROMPTS = {
+    "title": "Write one concise title, 8 words or fewer. Return only the title.\n\nInput:\n{input}",
+    "summary": "Summarize the input in one short paragraph or up to 4 bullets. Be factual and concise.\n\nInput:\n{input}",
+    "memory_candidate": (
+        "Extract durable memory candidates from the conversation excerpt. "
+        "Return strict JSON with keys: candidates (array of objects with fact, confidence, reason), notes. "
+        "Do not write memory; only propose candidates.\n\nInput:\n{input}"
+    ),
+    "notification": (
+        "Condense this notification or log excerpt for a human. "
+        "Return JSON with keys: severity (info|warning|error), category, summary, action_needed.\n\nInput:\n{input}"
+    ),
+}
+
+
+def import_openvino_genai() -> Any:
+    """Import OpenVINO GenAI lazily so unit tests do not require the NPU venv."""
+
+    import openvino_genai as ov_genai  # type: ignore[import-not-found]
+
+    return ov_genai
+
+
+def listener_exists(host: str, port: int) -> bool:
+    """Return True when a TCP listener already accepts connections."""
+
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
+        sock.settimeout(0.2)
+        return sock.connect_ex((host, port)) == 0
+
+
+def coerce_json(text: str) -> Any | None:
+    text = text.strip()
+    if not text:
+        return None
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        match = re.search(r"(\{.*\}|\[.*\])", text, re.S)
+        if match:
+            try:
+                return json.loads(match.group(1))
+            except json.JSONDecodeError:
+                return None
+    return None
+
+
+@dataclass
+class GenerationResult:
+    text: str
+    parsed_json: Any | None
+    timing_ms: dict[str, float]
+    npu_busy_delta_us: int
+    npu_busy_before_us: int
+    npu_busy_after_us: int
+
+
+class NpuWorker:
+    def __init__(
+        self,
+        model_path: str,
+        cache_dir: str,
+        *,
+        busy_path: Path = BUSY_PATH,
+        bind_host: str = HOST,
+        bind_port: int = PORT,
+    ):
+        self.model_path = Path(model_path)
+        self.cache_dir = Path(cache_dir)
+        self.busy_path = Path(busy_path)
+        self.bind_host = bind_host
+        self.bind_port = bind_port
+        self.cache_dir.mkdir(parents=True, exist_ok=True)
+        self._pipe = None
+        self._load_ms: float | None = None
+        self._lock = threading.Lock()
+        self._loaded_at: float | None = None
+        if not self.model_path.exists():
+            raise FileNotFoundError(f"model path does not exist: {self.model_path}")
+        if not self.busy_path.exists():
+            raise FileNotFoundError(f"NPU busy-time counter does not exist: {self.busy_path}")
+
+    def read_busy(self) -> int:
+        return int(self.busy_path.read_text().strip())
+
+    def load(self) -> None:
+        if self._pipe is not None:
+            return
+        start = time.monotonic()
+        # NPU GenAI requires bounded prompt/response shapes; CACHE_DIR enables compiled blob caching.
+        ov_genai = import_openvino_genai()
+        config = GENAI_CONFIG | {"CACHE_DIR": str(self.cache_dir)}
+        self._pipe = ov_genai.LLMPipeline(str(self.model_path), "NPU", **config)
+        self._load_ms = round((time.monotonic() - start) * 1000, 2)
+        self._loaded_at = time.time()
+
+    def generate(self, job: str, user_input: str, max_new_tokens: int | None = None) -> GenerationResult:
+        if job not in PROMPTS:
+            raise ValueError(f"unsupported job: {job}")
+        if not isinstance(user_input, str) or not user_input.strip():
+            raise ValueError("input must be a non-empty string")
+        if len(user_input) > MAX_INPUT_CHARS:
+            raise ValueError(f"input too long: {len(user_input)} chars > {MAX_INPUT_CHARS}")
+        max_new_tokens = int(max_new_tokens or DEFAULTS[job])
+        if max_new_tokens < 1 or max_new_tokens > MAX_NEW_TOKENS:
+            raise ValueError(f"max_new_tokens must be between 1 and {MAX_NEW_TOKENS}")
+        prompt = PROMPTS[job].format(input=user_input.strip())
+        with self._lock:
+            load_start = time.monotonic()
+            self.load()
+            load_ms = round((time.monotonic() - load_start) * 1000, 2)
+            before = self.read_busy()
+            gen_start = time.monotonic()
+            pipe = cast(Any, self._pipe)
+            text = str(pipe.generate(prompt, max_new_tokens=max_new_tokens)).strip()
+            generate_ms = round((time.monotonic() - gen_start) * 1000, 2)
+            after = self.read_busy()
+        parsed = coerce_json(text) if job in {"memory_candidate", "notification"} else None
+        if job == "memory_candidate" and isinstance(parsed, list):
+            parsed = {"candidates": parsed, "notes": "model returned a top-level array; worker wrapped it to preserve the API contract"}
+        return GenerationResult(
+            text=text,
+            parsed_json=parsed,
+            timing_ms={"load": load_ms, "initial_load": self._load_ms or 0.0, "generate": generate_ms, "total": round(load_ms + generate_ms, 2)},
+            npu_busy_delta_us=after - before,
+            npu_busy_before_us=before,
+            npu_busy_after_us=after,
+        )
+
+    def health(self) -> dict[str, Any]:
+        return {
+            "ok": True,
+            "model": MODEL_ID,
+            "model_path": str(self.model_path),
+            "device": "NPU",
+            "cache_dir": str(self.cache_dir),
+            "cache_exists": self.cache_dir.exists(),
+            "loaded": self._pipe is not None,
+            "initial_load_ms": self._load_ms,
+            "loaded_at": self._loaded_at,
+            "busy_time_us": self.read_busy(),
+            "max_input_chars": MAX_INPUT_CHARS,
+            "max_new_tokens": MAX_NEW_TOKENS,
+            "jobs": sorted(PROMPTS),
+            "bind": f"{self.bind_host}:{self.bind_port}",
+        }
+
+
+def response_payload(worker: NpuWorker, job: str, result: GenerationResult) -> dict[str, Any]:
+    return {
+        "model": MODEL_ID,
+        "device": "NPU",
+        "job": job,
+        "text": result.text,
+        "json": result.parsed_json,
+        "timing_ms": result.timing_ms,
+        "npu_busy_delta_us": result.npu_busy_delta_us,
+        "npu_busy_before_us": result.npu_busy_before_us,
+        "npu_busy_after_us": result.npu_busy_after_us,
+        "cache_dir": str(worker.cache_dir),
+    }
+
+
+def make_handler(worker: NpuWorker):
+    class Handler(BaseHTTPRequestHandler):
+        server_version = "openvino-genai-npu-worker/0.2"
+
+        def log_message(self, format: str, *args: Any) -> None:
+            # Log only method/path/status metadata, not raw request bodies.
+            print(f"{self.client_address[0]} {format % args}")
+
+        def send_json(self, status: int, payload: Any) -> None:
+            body = json.dumps(payload, indent=2).encode("utf-8")
+            self.send_response(status)
+            self.send_header("Content-Type", "application/json")
+            self.send_header("Content-Length", str(len(body)))
+            self.end_headers()
+            self.wfile.write(body)
+
+        def do_GET(self) -> None:  # noqa: N802
+            path = urlparse(self.path).path
+            if path == "/healthz":
+                self.send_json(200, worker.health())
+            elif path == "/models":
+                self.send_json(200, {"models": [{"id": MODEL_ID, "path": str(worker.model_path), "device": "NPU"}]})
+            else:
+                self.send_json(404, {"error": "not found"})
+
+        def do_POST(self) -> None:  # noqa: N802
+            path = urlparse(self.path).path
+            route_job = {
+                "/v1/worker/generate": None,
+                "/v1/worker/extract-memory-candidates": "memory_candidate",
+                "/v1/worker/condense-notification": "notification",
+            }.get(path, "__missing__")
+            if route_job == "__missing__":
+                self.send_json(404, {"error": "not found"})
+                return
+            try:
+                length = int(self.headers.get("Content-Length", "0"))
+                payload = json.loads(self.rfile.read(length) or b"{}")
+                job = route_job or str(payload.get("job", "summary"))
+                if job == "memory":
+                    job = "memory_candidate"
+                result = worker.generate(job, str(payload.get("input", "")), payload.get("max_new_tokens"))
+                body = response_payload(worker, job, result)
+                if result.npu_busy_delta_us <= 0:
+                    body["error"] = "NPU busy-time counter did not increase during generation"
+                    self.send_json(503, body)
+                    return
+                self.send_json(200, body)
+            except Exception as exc:
+                self.send_json(400, {"error": str(exc)})
+
+    return Handler
+
+
+def cli(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description="OpenVINO GenAI NPU worker")
+    parser.add_argument("--model-path", default=os.environ.get("OV_GENAI_NPU_MODEL", DEFAULT_MODEL_PATH))
+    parser.add_argument("--cache-dir", default=os.environ.get("OV_GENAI_NPU_CACHE", DEFAULT_CACHE_DIR))
+    parser.add_argument("--host", default=os.environ.get("OV_GENAI_NPU_HOST", HOST))
+    parser.add_argument("--port", type=int, default=int(os.environ.get("OV_GENAI_NPU_PORT", PORT)))
+    parser.add_argument("--job", choices=sorted(PROMPTS), help="Run one CLI job instead of serving HTTP")
+    parser.add_argument("--input", help="Input text for --job")
+    parser.add_argument("--max-new-tokens", type=int)
+    args = parser.parse_args(argv)
+
+    if args.host != "127.0.0.1":
+        raise SystemExit("Refusing non-local bind without code change/explicit approval")
+
+    worker = NpuWorker(args.model_path, args.cache_dir, bind_host=args.host, bind_port=args.port)
+    if args.job:
+        result = worker.generate(args.job, args.input or "", args.max_new_tokens)
+        print(json.dumps(response_payload(worker, args.job, result), indent=2))
+        return 0 if result.npu_busy_delta_us > 0 else 2
+
+    if listener_exists(args.host, args.port):
+        raise SystemExit(f"Refusing to start: listener already exists on {args.host}:{args.port}")
+    server = ThreadingHTTPServer((args.host, args.port), make_handler(worker))
+    print(f"serving {MODEL_ID} on http://{args.host}:{args.port}; raw prompts are not logged")
+    server.serve_forever()
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(cli())