Files

T

William Valentin 0683253157 feat(npu): add OpenVINO reranker prototype

2026-06-04 13:07:51 -07:00

10 KiB

Raw Blame History

OpenVINO NPU reranker service spec

Status: proposed localhost prototype; not live RAG integration. Target port: 127.0.0.1:18818. Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval.

Recommendation

Use cross-encoder/ms-marco-MiniLM-L6-v2, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in server.py on OpenVINO Runtime NPU.

Why this choice:

It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass.
The model shape is simple pairwise text classification/scoring: (query, document) -> score, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking.
INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults: /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov.
The service can fail closed on startup when OPENVINO_RERANKER_DEVICE=NPU but /sys/class/accel/accel0/device/npu_busy_time_us does not increase, preventing false "NPU-backed" claims.

Runtime default:

OPENVINO_RERANKER_HOST=127.0.0.1
OPENVINO_RERANKER_PORT=18818
OPENVINO_RERANKER_DEVICE=NPU
OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
OPENVINO_RERANKER_MAX_LENGTH=512
OPENVINO_RERANKER_MAX_DOCUMENTS=100
OPENVINO_RERANKER_MAX_BODY_BYTES=5242880

Endpoint contract

Health and readiness

GET /healthz and GET /readyz return JSON.

/readyz must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive npu_busy_delta_us.

Representative ready response:

{
  "status": "ok",
  "ok": true,
  "service": "openvino-reranker",
  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
  "model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov",
  "device": "NPU",
  "available_devices": ["CPU", "NPU"],
  "max_length": 512,
  "startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234},
  "last_inference": null,
  "ready_error": null
}

Rerank

POST /rerank and compatibility alias POST /v1/rerank accept:

{
  "query": "how do I verify OpenVINO NPU usage?",
  "documents": [
    {"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}},
    {"id": "bad", "text": "This note is about making sourdough starter."}
  ],
  "top_k": 2,
  "return_documents": false
}

Compatibility notes:

documents may be strings or objects with id, text, and optional object metadata.
top_k is preferred; top_n is accepted for common reranker-client compatibility.
return_documents=false is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces.
The optional model field may be sent by clients but is not used for routing; this sidecar serves one configured model.

Successful response:

{
  "ok": true,
  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
  "device": "NPU",
  "query": "how do I verify OpenVINO NPU usage?",
  "input_count": 2,
  "top_k": 2,
  "duration_ms": 10.5,
  "npu_busy_delta_us": 1234,
  "results": [
    {"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997},
    {"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148}
  ]
}

Error response shape:

{"ok": false, "error": "human-readable error", "results": []}

Status behavior:

400: invalid JSON schema, empty query, missing/empty documents, invalid document text, or non-positive/non-integer top_k/top_n.
413: request body above OPENVINO_RERANKER_MAX_BODY_BYTES.
503: model not ready.
500: unexpected inference/runtime failure.

CLI contract

Foreground-only review start:

ss -ltnp | grep ':18818\b' || true
cat /sys/class/accel/accel0/device/npu_busy_time_us
source /home/will/.venvs/openvino-reranker/bin/activate
OPENVINO_RERANKER_HOST=127.0.0.1 \
OPENVINO_RERANKER_PORT=18818 \
OPENVINO_RERANKER_DEVICE=NPU \
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
python /home/will/lab/swarm/openvino-reranker-npu/server.py

Client smoke:

source /home/will/.venvs/openvino-reranker/bin/activate
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818

Optional user-systemd unit exists as openvino-reranker.service, but this spec does not approve copying, starting, enabling, or wiring it into live paths.

Non-private smoke payload

Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke.

Minimum cases:

Query: how do I verify OpenVINO NPU usage?
- Expected top document: Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.
- Distractor: This note is about making sourdough starter.
Query: what port does the reranker service use?
- Expected top document: The OpenVINO reranker prototype listens locally on port 18818.
- Distractor: Whisper transcription accepts audio uploads.
Query: why should reranking not mutate vector collections?
- Expected top document: Reranking is a read-only second-stage transformation after vector search.
- Distractor: Boil pasta in salted water until al dente.

Pass criteria:

/readyz is HTTP 200 and reports device=NPU.
Every case returns ok=true and a sorted results list with the expected top id.
Response-level npu_busy_delta_us is positive for each case.
External sysfs after - before is positive for each case or at least for the full smoke batch.
Smoke script exits 0 and prints JSON with ok: true.

NPU busy-time verification plan

HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas.

Procedure:

BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
before=$(cat "$BUSY")
curl -fsS http://127.0.0.1:18818/rerank \
  -H 'Content-Type: application/json' \
  -d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \
  | jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}'
after=$(cat "$BUSY")
echo "sysfs_npu_busy_delta_us=$((after-before))"

Acceptance:

device == "NPU".
Response npu_busy_delta_us > 0.
Shell-computed sysfs_npu_busy_delta_us > 0.
If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking.

Optional RAG second-stage integration plan (deferred)

This is a plan only. Do not enable it in live RAG without explicit approval.

Design:

Keep existing vector search and Chroma collection obsidian_bge_npu unchanged.
Retrieve more candidates from current vector search, e.g. initial_k=20.
Send only request-time candidate snippets/ids to http://127.0.0.1:18818/rerank.
Use reranker order to choose final top_k, e.g. 5.
On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like rerank_error; do not fail the whole RAG request unless explicitly configured.
Log counters and latency, but avoid logging raw private document text.

Disabled-by-default knobs:

RAG_RERANK_ENABLED=false
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
RAG_RERANK_INITIAL_K=20
RAG_RERANK_TOP_K=5
RAG_RERANK_TIMEOUT_MS=3000
RAG_RERANK_REQUIRE_NPU_PROOF=true
RAG_RERANK_RETURN_DOCUMENTS=false

Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection.

Docs and diagram implications

If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear:

openvino-reranker-npu/README.md: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code.
swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md: list :18818 as prototype/not enabled, with foreground smoke and NPU sysfs proof.
Service catalog / architecture notes: show live baseline :18810, :18816, :18817; show :18818 as optional second-stage RAG prototype, not live routing.
Diagrams: render RAG :18810 -> optional reranker :18818 as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it.
Optional systemd unit: document as installable after approval, not enabled by default.

No-go / defer criteria

Do not ship, enable, or integrate the reranker if any of these hold:

Port 18818 is already owned by another live service.
NPU is unavailable in ov.Core().available_devices or /sys/class/accel/accel0/device/npu_busy_time_us is missing.
Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU.
Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget.
Model export requires overwriting the existing model directory or touching Chroma/vector collections.
The service must bind beyond 127.0.0.1 to be useful.
Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval.
Logs or responses would persist raw private document text outside the existing RAG request path.

Current local preflight observed during this spec pass

/sys/class/accel/accel0/device/npu_busy_time_us is readable.
/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov is present.
/home/will/.venvs/openvino-reranker/bin/python is present.
:18818 was not listening during preflight.
server.py and smoke.py pass python -m py_compile.

These observations are preflight only; they are not a live service/NPU smoke result.

10 KiB Raw Blame History