feat: add OpenVINO NPU prototype services
This commit is contained in:
@@ -0,0 +1,138 @@
|
||||
# OpenVINO NPU reranker service
|
||||
|
||||
Local-first cross-encoder reranker prototype for second-stage RAG ranking.
|
||||
|
||||
- Default bind: `127.0.0.1:18818`
|
||||
- Default model: `cross-encoder/ms-marco-MiniLM-L6-v2`
|
||||
- Default device: `NPU`
|
||||
- Model cache: `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov/`
|
||||
- NPU proof: `/sys/class/accel/accel0/device/npu_busy_time_us` delta before/after inference
|
||||
|
||||
This service is intentionally not wired into live RAG by default.
|
||||
|
||||
## Files
|
||||
|
||||
- `server.py` — stdlib HTTP OpenVINO Runtime service.
|
||||
- `smoke.py` — non-private API/ranking/NPU busy-time smoke test.
|
||||
- `openvino-reranker.service` — optional user-systemd unit.
|
||||
|
||||
## One-time setup
|
||||
|
||||
Use a separate venv so the existing Whisper/embeddings NPU venv is not perturbed:
|
||||
|
||||
```bash
|
||||
python -m venv /home/will/.venvs/openvino-reranker
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
python -m pip install -U pip
|
||||
python -m pip install "openvino>=2026.2" "optimum-intel[openvino]" transformers tokenizers nncf numpy
|
||||
```
|
||||
|
||||
Export the model:
|
||||
|
||||
```bash
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
optimum-cli export openvino \
|
||||
--model cross-encoder/ms-marco-MiniLM-L6-v2 \
|
||||
--task text-classification \
|
||||
--weight-format int8 \
|
||||
--trust-remote-code false \
|
||||
/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
|
||||
```
|
||||
|
||||
If INT8 export or NPU compile fails, export an FP16/FP32 IR to a separate directory and point `OPENVINO_RERANKER_MODEL_DIR` at it while debugging. Do not overwrite existing vector/RAG/Chroma collections.
|
||||
|
||||
## Run in foreground
|
||||
|
||||
Check the port and NPU counter first:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18818 ' || true
|
||||
cat /sys/class/accel/accel0/device/npu_busy_time_us
|
||||
```
|
||||
|
||||
Start locally:
|
||||
|
||||
```bash
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
OPENVINO_RERANKER_HOST=127.0.0.1 \
|
||||
OPENVINO_RERANKER_PORT=18818 \
|
||||
OPENVINO_RERANKER_DEVICE=NPU \
|
||||
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
|
||||
python /home/will/lab/swarm/openvino-reranker-npu/server.py
|
||||
```
|
||||
|
||||
Startup performs a non-private smoke inference and fails closed when `OPENVINO_RERANKER_DEVICE=NPU` but `npu_busy_time_us` does not increase.
|
||||
|
||||
## API
|
||||
|
||||
Health:
|
||||
|
||||
```bash
|
||||
curl -sS http://127.0.0.1:18818/healthz | jq
|
||||
curl -sS http://127.0.0.1:18818/readyz | jq
|
||||
```
|
||||
|
||||
Rerank:
|
||||
|
||||
```bash
|
||||
curl -sS http://127.0.0.1:18818/rerank \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"query":"how do I verify OpenVINO NPU usage?",
|
||||
"documents":[
|
||||
{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},
|
||||
{"id":"bad","text":"This note is about making sourdough starter."}
|
||||
],
|
||||
"top_k":2
|
||||
}' | jq
|
||||
```
|
||||
|
||||
Compatibility alias:
|
||||
|
||||
```bash
|
||||
curl -sS http://127.0.0.1:18818/v1/rerank \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"model":"local-reranker","query":"npu busy time","documents":["OpenVINO NPU busy time proves accelerator use."],"top_n":1}' | jq
|
||||
```
|
||||
|
||||
## Smoke test
|
||||
|
||||
```bash
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
|
||||
```
|
||||
|
||||
Expected:
|
||||
|
||||
- `/readyz` is HTTP 200 and reports `device=NPU`.
|
||||
- Each fixture returns `ok=true` and a sorted `results` list.
|
||||
- The top result matches the non-private fixture expectation.
|
||||
- Response and sysfs `npu_busy_delta_us` are positive.
|
||||
|
||||
## Optional systemd user service
|
||||
|
||||
Install the unit only after the foreground command and smoke test pass:
|
||||
|
||||
```bash
|
||||
cp /home/will/lab/swarm/openvino-reranker-npu/openvino-reranker.service /home/will/.config/systemd/user/openvino-reranker.service
|
||||
systemctl --user daemon-reload
|
||||
systemctl --user start openvino-reranker.service
|
||||
systemctl --user status openvino-reranker.service --no-pager
|
||||
journalctl --user -u openvino-reranker.service -n 100 --no-pager
|
||||
```
|
||||
|
||||
Do not enable or integrate it into live RAG without explicit approval.
|
||||
|
||||
## Optional RAG integration plan (disabled by default)
|
||||
|
||||
RAG should keep vector search against `obsidian_bge_npu` unchanged, retrieve a larger candidate set, and call this service as a read-only request-time second stage. Suggested disabled-by-default knobs:
|
||||
|
||||
```text
|
||||
RAG_RERANK_ENABLED=false
|
||||
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
|
||||
RAG_RERANK_INITIAL_K=20
|
||||
RAG_RERANK_TOP_K=5
|
||||
RAG_RERANK_TIMEOUT_MS=3000
|
||||
```
|
||||
|
||||
On reranker timeout/error, fall back to vector order and include metadata such as `rerank_error`; do not mutate or reindex Chroma collections.
|
||||
Reference in New Issue
Block a user