244 lines
10 KiB
Markdown
244 lines
10 KiB
Markdown
# OpenVINO NPU reranker service spec
|
|
|
|
Status: proposed localhost prototype; not live RAG integration.
|
|
Target port: `127.0.0.1:18818`.
|
|
Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval.
|
|
|
|
## Recommendation
|
|
|
|
Use `cross-encoder/ms-marco-MiniLM-L6-v2`, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in `server.py` on OpenVINO Runtime `NPU`.
|
|
|
|
Why this choice:
|
|
|
|
- It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass.
|
|
- The model shape is simple pairwise text classification/scoring: `(query, document) -> score`, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking.
|
|
- INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults:
|
|
`/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov`.
|
|
- The service can fail closed on startup when `OPENVINO_RERANKER_DEVICE=NPU` but `/sys/class/accel/accel0/device/npu_busy_time_us` does not increase, preventing false "NPU-backed" claims.
|
|
|
|
Runtime default:
|
|
|
|
```text
|
|
OPENVINO_RERANKER_HOST=127.0.0.1
|
|
OPENVINO_RERANKER_PORT=18818
|
|
OPENVINO_RERANKER_DEVICE=NPU
|
|
OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
|
|
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
|
|
OPENVINO_RERANKER_MAX_LENGTH=512
|
|
OPENVINO_RERANKER_MAX_DOCUMENTS=100
|
|
OPENVINO_RERANKER_MAX_BODY_BYTES=5242880
|
|
```
|
|
|
|
## Endpoint contract
|
|
|
|
### Health and readiness
|
|
|
|
`GET /healthz` and `GET /readyz` return JSON.
|
|
|
|
`/readyz` must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive `npu_busy_delta_us`.
|
|
|
|
Representative ready response:
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"ok": true,
|
|
"service": "openvino-reranker",
|
|
"model": "cross-encoder/ms-marco-MiniLM-L6-v2",
|
|
"model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov",
|
|
"device": "NPU",
|
|
"available_devices": ["CPU", "NPU"],
|
|
"max_length": 512,
|
|
"startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234},
|
|
"last_inference": null,
|
|
"ready_error": null
|
|
}
|
|
```
|
|
|
|
### Rerank
|
|
|
|
`POST /rerank` and compatibility alias `POST /v1/rerank` accept:
|
|
|
|
```json
|
|
{
|
|
"query": "how do I verify OpenVINO NPU usage?",
|
|
"documents": [
|
|
{"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}},
|
|
{"id": "bad", "text": "This note is about making sourdough starter."}
|
|
],
|
|
"top_k": 2,
|
|
"return_documents": false
|
|
}
|
|
```
|
|
|
|
Compatibility notes:
|
|
|
|
- `documents` may be strings or objects with `id`, `text`, and optional object `metadata`.
|
|
- `top_k` is preferred; `top_n` is accepted for common reranker-client compatibility.
|
|
- `return_documents=false` is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces.
|
|
- The optional `model` field may be sent by clients but is not used for routing; this sidecar serves one configured model.
|
|
|
|
Successful response:
|
|
|
|
```json
|
|
{
|
|
"ok": true,
|
|
"model": "cross-encoder/ms-marco-MiniLM-L6-v2",
|
|
"device": "NPU",
|
|
"query": "how do I verify OpenVINO NPU usage?",
|
|
"input_count": 2,
|
|
"top_k": 2,
|
|
"duration_ms": 10.5,
|
|
"npu_busy_delta_us": 1234,
|
|
"results": [
|
|
{"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997},
|
|
{"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148}
|
|
]
|
|
}
|
|
```
|
|
|
|
Error response shape:
|
|
|
|
```json
|
|
{"ok": false, "error": "human-readable error", "results": []}
|
|
```
|
|
|
|
Status behavior:
|
|
|
|
- 400: invalid JSON schema, empty query, missing/empty documents, invalid document text, or non-positive/non-integer `top_k`/`top_n`.
|
|
- 413: request body above `OPENVINO_RERANKER_MAX_BODY_BYTES`.
|
|
- 503: model not ready.
|
|
- 500: unexpected inference/runtime failure.
|
|
|
|
## CLI contract
|
|
|
|
Foreground-only review start:
|
|
|
|
```bash
|
|
ss -ltnp | grep ':18818\b' || true
|
|
cat /sys/class/accel/accel0/device/npu_busy_time_us
|
|
source /home/will/.venvs/openvino-reranker/bin/activate
|
|
OPENVINO_RERANKER_HOST=127.0.0.1 \
|
|
OPENVINO_RERANKER_PORT=18818 \
|
|
OPENVINO_RERANKER_DEVICE=NPU \
|
|
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
|
|
python /home/will/lab/swarm/openvino-reranker-npu/server.py
|
|
```
|
|
|
|
Client smoke:
|
|
|
|
```bash
|
|
source /home/will/.venvs/openvino-reranker/bin/activate
|
|
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
|
|
```
|
|
|
|
Optional user-systemd unit exists as `openvino-reranker.service`, but this spec does not approve copying, starting, enabling, or wiring it into live paths.
|
|
|
|
## Non-private smoke payload
|
|
|
|
Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke.
|
|
|
|
Minimum cases:
|
|
|
|
1. Query: `how do I verify OpenVINO NPU usage?`
|
|
- Expected top document: `Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.`
|
|
- Distractor: `This note is about making sourdough starter.`
|
|
2. Query: `what port does the reranker service use?`
|
|
- Expected top document: `The OpenVINO reranker prototype listens locally on port 18818.`
|
|
- Distractor: `Whisper transcription accepts audio uploads.`
|
|
3. Query: `why should reranking not mutate vector collections?`
|
|
- Expected top document: `Reranking is a read-only second-stage transformation after vector search.`
|
|
- Distractor: `Boil pasta in salted water until al dente.`
|
|
|
|
Pass criteria:
|
|
|
|
- `/readyz` is HTTP 200 and reports `device=NPU`.
|
|
- Every case returns `ok=true` and a sorted `results` list with the expected top `id`.
|
|
- Response-level `npu_busy_delta_us` is positive for each case.
|
|
- External sysfs `after - before` is positive for each case or at least for the full smoke batch.
|
|
- Smoke script exits 0 and prints JSON with `ok: true`.
|
|
|
|
## NPU busy-time verification plan
|
|
|
|
HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas.
|
|
|
|
Procedure:
|
|
|
|
```bash
|
|
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
|
before=$(cat "$BUSY")
|
|
curl -fsS http://127.0.0.1:18818/rerank \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \
|
|
| jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}'
|
|
after=$(cat "$BUSY")
|
|
echo "sysfs_npu_busy_delta_us=$((after-before))"
|
|
```
|
|
|
|
Acceptance:
|
|
|
|
- `device == "NPU"`.
|
|
- Response `npu_busy_delta_us > 0`.
|
|
- Shell-computed `sysfs_npu_busy_delta_us > 0`.
|
|
- If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking.
|
|
|
|
## Optional RAG second-stage integration plan (deferred)
|
|
|
|
This is a plan only. Do not enable it in live RAG without explicit approval.
|
|
|
|
Design:
|
|
|
|
1. Keep existing vector search and Chroma collection `obsidian_bge_npu` unchanged.
|
|
2. Retrieve more candidates from current vector search, e.g. `initial_k=20`.
|
|
3. Send only request-time candidate snippets/ids to `http://127.0.0.1:18818/rerank`.
|
|
4. Use reranker order to choose final `top_k`, e.g. `5`.
|
|
5. On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like `rerank_error`; do not fail the whole RAG request unless explicitly configured.
|
|
6. Log counters and latency, but avoid logging raw private document text.
|
|
|
|
Disabled-by-default knobs:
|
|
|
|
```text
|
|
RAG_RERANK_ENABLED=false
|
|
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
|
|
RAG_RERANK_INITIAL_K=20
|
|
RAG_RERANK_TOP_K=5
|
|
RAG_RERANK_TIMEOUT_MS=3000
|
|
RAG_RERANK_REQUIRE_NPU_PROOF=true
|
|
RAG_RERANK_RETURN_DOCUMENTS=false
|
|
```
|
|
|
|
Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection.
|
|
|
|
## Docs and diagram implications
|
|
|
|
If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear:
|
|
|
|
- `openvino-reranker-npu/README.md`: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code.
|
|
- `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md`: list `:18818` as prototype/not enabled, with foreground smoke and NPU sysfs proof.
|
|
- Service catalog / architecture notes: show live baseline `:18810`, `:18816`, `:18817`; show `:18818` as optional second-stage RAG prototype, not live routing.
|
|
- Diagrams: render `RAG :18810 -> optional reranker :18818` as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it.
|
|
- Optional systemd unit: document as installable after approval, not enabled by default.
|
|
|
|
## No-go / defer criteria
|
|
|
|
Do not ship, enable, or integrate the reranker if any of these hold:
|
|
|
|
- Port `18818` is already owned by another live service.
|
|
- `NPU` is unavailable in `ov.Core().available_devices` or `/sys/class/accel/accel0/device/npu_busy_time_us` is missing.
|
|
- Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU.
|
|
- Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget.
|
|
- Model export requires overwriting the existing model directory or touching Chroma/vector collections.
|
|
- The service must bind beyond `127.0.0.1` to be useful.
|
|
- Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval.
|
|
- Logs or responses would persist raw private document text outside the existing RAG request path.
|
|
|
|
## Current local preflight observed during this spec pass
|
|
|
|
- `/sys/class/accel/accel0/device/npu_busy_time_us` is readable.
|
|
- `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov` is present.
|
|
- `/home/will/.venvs/openvino-reranker/bin/python` is present.
|
|
- `:18818` was not listening during preflight.
|
|
- `server.py` and `smoke.py` pass `python -m py_compile`.
|
|
|
|
These observations are preflight only; they are not a live service/NPU smoke result.
|