Files
swarm-master/openvino-reranker-npu

OpenVINO NPU reranker service

Local-first cross-encoder reranker prototype for second-stage RAG ranking.

  • Default bind: 127.0.0.1:18818
  • Default model: cross-encoder/ms-marco-MiniLM-L6-v2
  • Default device: NPU
  • Model cache: /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov/
  • NPU proof: /sys/class/accel/accel0/device/npu_busy_time_us delta before/after inference

This service is intentionally not wired into live RAG by default.

Files

  • SPEC.md — endpoint/CLI contract, model/runtime recommendation, smoke/NPU proof plan, RAG integration plan, docs implications, and no-go criteria.
  • server.py — stdlib HTTP OpenVINO Runtime service with fail-fast localhost listener conflict checks and request validation.
  • smoke.py — non-private API/ranking/NPU busy-time smoke test.
  • tests/test_server_validation.py — stdlib unit checks for request validation and listener conflict detection.
  • openvino-reranker.service — optional user-systemd unit.

One-time setup

Use a separate venv so the existing Whisper/embeddings NPU venv is not perturbed:

python -m venv /home/will/.venvs/openvino-reranker
source /home/will/.venvs/openvino-reranker/bin/activate
python -m pip install -U pip
python -m pip install "openvino>=2026.2" "optimum-intel[openvino]" transformers tokenizers nncf numpy

Export the model:

source /home/will/.venvs/openvino-reranker/bin/activate
optimum-cli export openvino \
  --model cross-encoder/ms-marco-MiniLM-L6-v2 \
  --task text-classification \
  --weight-format int8 \
  --trust-remote-code false \
  /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov

If INT8 export or NPU compile fails, export an FP16/FP32 IR to a separate directory and point OPENVINO_RERANKER_MODEL_DIR at it while debugging. Do not overwrite existing vector/RAG/Chroma collections.

Run in foreground

Check the port and NPU counter first:

ss -ltnp | grep ':18818 ' || true
cat /sys/class/accel/accel0/device/npu_busy_time_us

Start locally:

source /home/will/.venvs/openvino-reranker/bin/activate
OPENVINO_RERANKER_HOST=127.0.0.1 \
OPENVINO_RERANKER_PORT=18818 \
OPENVINO_RERANKER_DEVICE=NPU \
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
python /home/will/lab/swarm/openvino-reranker-npu/server.py

Startup performs a non-private smoke inference and fails closed when OPENVINO_RERANKER_DEVICE=NPU but npu_busy_time_us does not increase. It also checks whether the requested listener can bind before compiling the OpenVINO model, so obvious port conflicts fail fast; the real server bind still happens immediately after model load.

API

Health:

curl -sS http://127.0.0.1:18818/healthz | jq
curl -sS http://127.0.0.1:18818/readyz | jq

Rerank:

curl -sS http://127.0.0.1:18818/rerank \
  -H 'Content-Type: application/json' \
  -d '{
    "query":"how do I verify OpenVINO NPU usage?",
    "documents":[
      {"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},
      {"id":"bad","text":"This note is about making sourdough starter."}
    ],
    "top_k":2
  }' | jq

Compatibility alias:

curl -sS http://127.0.0.1:18818/v1/rerank \
  -H 'Content-Type: application/json' \
  -d '{"model":"local-reranker","query":"npu busy time","documents":["OpenVINO NPU busy time proves accelerator use."],"top_n":1}' | jq

Smoke test

source /home/will/.venvs/openvino-reranker/bin/activate
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818

Expected:

  • /readyz is HTTP 200 and reports device=NPU.
  • Each fixture returns ok=true and a sorted results list.
  • The top result matches the non-private fixture expectation.
  • Response and sysfs npu_busy_delta_us are positive.

Validation checks

source /home/will/.venvs/openvino-reranker/bin/activate
PYTHONPATH=/home/will/lab/swarm/openvino-reranker-npu \
  python -m unittest discover -s /home/will/lab/swarm/openvino-reranker-npu/tests

These checks do not compile the OpenVINO model; they cover request validation and fail-fast listener conflict detection.

Optional systemd user service

Install the unit only after the foreground command and smoke test pass:

cp /home/will/lab/swarm/openvino-reranker-npu/openvino-reranker.service /home/will/.config/systemd/user/openvino-reranker.service
systemctl --user daemon-reload
systemctl --user start openvino-reranker.service
systemctl --user status openvino-reranker.service --no-pager
journalctl --user -u openvino-reranker.service -n 100 --no-pager

Do not enable or integrate it into live RAG without explicit approval.

Optional RAG integration plan (disabled by default)

RAG should keep vector search against obsidian_bge_npu unchanged, retrieve a larger candidate set, and call this service as a read-only request-time second stage. Suggested disabled-by-default knobs:

RAG_RERANK_ENABLED=false
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
RAG_RERANK_INITIAL_K=20
RAG_RERANK_TOP_K=5
RAG_RERANK_TIMEOUT_MS=3000

On reranker timeout/error, fall back to vector order and include metadata such as rerank_error; do not mutate or reindex Chroma collections.