# OpenVINO NPU reranker service Local-first cross-encoder reranker prototype for second-stage RAG ranking. - Default bind: `127.0.0.1:18818` - Default model: `cross-encoder/ms-marco-MiniLM-L6-v2` - Default device: `NPU` - Model cache: `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov/` - NPU proof: `/sys/class/accel/accel0/device/npu_busy_time_us` delta before/after inference This service is intentionally not wired into live RAG by default. ## Files - `SPEC.md` — endpoint/CLI contract, model/runtime recommendation, smoke/NPU proof plan, RAG integration plan, docs implications, and no-go criteria. - `server.py` — stdlib HTTP OpenVINO Runtime service with fail-fast localhost listener conflict checks and request validation. - `smoke.py` — non-private API/ranking/NPU busy-time smoke test. - `tests/test_server_validation.py` — stdlib unit checks for request validation and listener conflict detection. - `openvino-reranker.service` — optional user-systemd unit. ## One-time setup Use a separate venv so the existing Whisper/embeddings NPU venv is not perturbed: ```bash python -m venv /home/will/.venvs/openvino-reranker source /home/will/.venvs/openvino-reranker/bin/activate python -m pip install -U pip python -m pip install "openvino>=2026.2" "optimum-intel[openvino]" transformers tokenizers nncf numpy ``` Export the model: ```bash source /home/will/.venvs/openvino-reranker/bin/activate optimum-cli export openvino \ --model cross-encoder/ms-marco-MiniLM-L6-v2 \ --task text-classification \ --weight-format int8 \ --trust-remote-code false \ /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov ``` If INT8 export or NPU compile fails, export an FP16/FP32 IR to a separate directory and point `OPENVINO_RERANKER_MODEL_DIR` at it while debugging. Do not overwrite existing vector/RAG/Chroma collections. ## Run in foreground Check the port and NPU counter first: ```bash ss -ltnp | grep ':18818 ' || true cat /sys/class/accel/accel0/device/npu_busy_time_us ``` Start locally: ```bash source /home/will/.venvs/openvino-reranker/bin/activate OPENVINO_RERANKER_HOST=127.0.0.1 \ OPENVINO_RERANKER_PORT=18818 \ OPENVINO_RERANKER_DEVICE=NPU \ OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \ python /home/will/lab/swarm/openvino-reranker-npu/server.py ``` Startup performs a non-private smoke inference and fails closed when `OPENVINO_RERANKER_DEVICE=NPU` but `npu_busy_time_us` does not increase. It also checks whether the requested listener can bind before compiling the OpenVINO model, so obvious port conflicts fail fast; the real server bind still happens immediately after model load. ## API Health: ```bash curl -sS http://127.0.0.1:18818/healthz | jq curl -sS http://127.0.0.1:18818/readyz | jq ``` Rerank: ```bash curl -sS http://127.0.0.1:18818/rerank \ -H 'Content-Type: application/json' \ -d '{ "query":"how do I verify OpenVINO NPU usage?", "documents":[ {"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."}, {"id":"bad","text":"This note is about making sourdough starter."} ], "top_k":2 }' | jq ``` Compatibility alias: ```bash curl -sS http://127.0.0.1:18818/v1/rerank \ -H 'Content-Type: application/json' \ -d '{"model":"local-reranker","query":"npu busy time","documents":["OpenVINO NPU busy time proves accelerator use."],"top_n":1}' | jq ``` ## Smoke test ```bash source /home/will/.venvs/openvino-reranker/bin/activate python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818 ``` Expected: - `/readyz` is HTTP 200 and reports `device=NPU`. - Each fixture returns `ok=true` and a sorted `results` list. - The top result matches the non-private fixture expectation. - Response and sysfs `npu_busy_delta_us` are positive. ## Validation checks ```bash source /home/will/.venvs/openvino-reranker/bin/activate PYTHONPATH=/home/will/lab/swarm/openvino-reranker-npu \ python -m unittest discover -s /home/will/lab/swarm/openvino-reranker-npu/tests ``` These checks do not compile the OpenVINO model; they cover request validation and fail-fast listener conflict detection. ## Optional systemd user service Install the unit only after the foreground command and smoke test pass: ```bash cp /home/will/lab/swarm/openvino-reranker-npu/openvino-reranker.service /home/will/.config/systemd/user/openvino-reranker.service systemctl --user daemon-reload systemctl --user start openvino-reranker.service systemctl --user status openvino-reranker.service --no-pager journalctl --user -u openvino-reranker.service -n 100 --no-pager ``` Do not enable or integrate it into live RAG without explicit approval. ## Optional RAG integration plan (disabled by default) RAG should keep vector search against `obsidian_bge_npu` unchanged, retrieve a larger candidate set, and call this service as a read-only request-time second stage. Suggested disabled-by-default knobs: ```text RAG_RERANK_ENABLED=false RAG_RERANK_URL=http://127.0.0.1:18818/rerank RAG_RERANK_INITIAL_K=20 RAG_RERANK_TOP_K=5 RAG_RERANK_TIMEOUT_MS=3000 ``` On reranker timeout/error, fall back to vector order and include metadata such as `rerank_error`; do not mutate or reindex Chroma collections.