Files
swarm-master/openvino-reranker-npu/SPEC.md
T
2026-06-04 12:16:15 -07:00

10 KiB

OpenVINO NPU reranker service spec

Status: proposed localhost prototype; not live RAG integration. Target port: 127.0.0.1:18818. Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval.

Recommendation

Use cross-encoder/ms-marco-MiniLM-L6-v2, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in server.py on OpenVINO Runtime NPU.

Why this choice:

  • It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass.
  • The model shape is simple pairwise text classification/scoring: (query, document) -> score, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking.
  • INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults: /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov.
  • The service can fail closed on startup when OPENVINO_RERANKER_DEVICE=NPU but /sys/class/accel/accel0/device/npu_busy_time_us does not increase, preventing false "NPU-backed" claims.

Runtime default:

OPENVINO_RERANKER_HOST=127.0.0.1
OPENVINO_RERANKER_PORT=18818
OPENVINO_RERANKER_DEVICE=NPU
OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
OPENVINO_RERANKER_MAX_LENGTH=512
OPENVINO_RERANKER_MAX_DOCUMENTS=100
OPENVINO_RERANKER_MAX_BODY_BYTES=5242880

Endpoint contract

Health and readiness

GET /healthz and GET /readyz return JSON.

/readyz must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive npu_busy_delta_us.

Representative ready response:

{
  "status": "ok",
  "ok": true,
  "service": "openvino-reranker",
  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
  "model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov",
  "device": "NPU",
  "available_devices": ["CPU", "NPU"],
  "max_length": 512,
  "startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234},
  "last_inference": null,
  "ready_error": null
}

Rerank

POST /rerank and compatibility alias POST /v1/rerank accept:

{
  "query": "how do I verify OpenVINO NPU usage?",
  "documents": [
    {"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}},
    {"id": "bad", "text": "This note is about making sourdough starter."}
  ],
  "top_k": 2,
  "return_documents": false
}

Compatibility notes:

  • documents may be strings or objects with id, text, and optional object metadata.
  • top_k is preferred; top_n is accepted for common reranker-client compatibility.
  • return_documents=false is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces.
  • The optional model field may be sent by clients but is not used for routing; this sidecar serves one configured model.

Successful response:

{
  "ok": true,
  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
  "device": "NPU",
  "query": "how do I verify OpenVINO NPU usage?",
  "input_count": 2,
  "top_k": 2,
  "duration_ms": 10.5,
  "npu_busy_delta_us": 1234,
  "results": [
    {"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997},
    {"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148}
  ]
}

Error response shape:

{"ok": false, "error": "human-readable error", "results": []}

Status behavior:

  • 400: invalid JSON schema, empty query, missing/empty documents, invalid document text, or non-positive/non-integer top_k/top_n.
  • 413: request body above OPENVINO_RERANKER_MAX_BODY_BYTES.
  • 503: model not ready.
  • 500: unexpected inference/runtime failure.

CLI contract

Foreground-only review start:

ss -ltnp | grep ':18818\b' || true
cat /sys/class/accel/accel0/device/npu_busy_time_us
source /home/will/.venvs/openvino-reranker/bin/activate
OPENVINO_RERANKER_HOST=127.0.0.1 \
OPENVINO_RERANKER_PORT=18818 \
OPENVINO_RERANKER_DEVICE=NPU \
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
python /home/will/lab/swarm/openvino-reranker-npu/server.py

Client smoke:

source /home/will/.venvs/openvino-reranker/bin/activate
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818

Optional user-systemd unit exists as openvino-reranker.service, but this spec does not approve copying, starting, enabling, or wiring it into live paths.

Non-private smoke payload

Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke.

Minimum cases:

  1. Query: how do I verify OpenVINO NPU usage?
    • Expected top document: Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.
    • Distractor: This note is about making sourdough starter.
  2. Query: what port does the reranker service use?
    • Expected top document: The OpenVINO reranker prototype listens locally on port 18818.
    • Distractor: Whisper transcription accepts audio uploads.
  3. Query: why should reranking not mutate vector collections?
    • Expected top document: Reranking is a read-only second-stage transformation after vector search.
    • Distractor: Boil pasta in salted water until al dente.

Pass criteria:

  • /readyz is HTTP 200 and reports device=NPU.
  • Every case returns ok=true and a sorted results list with the expected top id.
  • Response-level npu_busy_delta_us is positive for each case.
  • External sysfs after - before is positive for each case or at least for the full smoke batch.
  • Smoke script exits 0 and prints JSON with ok: true.

NPU busy-time verification plan

HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas.

Procedure:

BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
before=$(cat "$BUSY")
curl -fsS http://127.0.0.1:18818/rerank \
  -H 'Content-Type: application/json' \
  -d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \
  | jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}'
after=$(cat "$BUSY")
echo "sysfs_npu_busy_delta_us=$((after-before))"

Acceptance:

  • device == "NPU".
  • Response npu_busy_delta_us > 0.
  • Shell-computed sysfs_npu_busy_delta_us > 0.
  • If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking.

Optional RAG second-stage integration plan (deferred)

This is a plan only. Do not enable it in live RAG without explicit approval.

Design:

  1. Keep existing vector search and Chroma collection obsidian_bge_npu unchanged.
  2. Retrieve more candidates from current vector search, e.g. initial_k=20.
  3. Send only request-time candidate snippets/ids to http://127.0.0.1:18818/rerank.
  4. Use reranker order to choose final top_k, e.g. 5.
  5. On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like rerank_error; do not fail the whole RAG request unless explicitly configured.
  6. Log counters and latency, but avoid logging raw private document text.

Disabled-by-default knobs:

RAG_RERANK_ENABLED=false
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
RAG_RERANK_INITIAL_K=20
RAG_RERANK_TOP_K=5
RAG_RERANK_TIMEOUT_MS=3000
RAG_RERANK_REQUIRE_NPU_PROOF=true
RAG_RERANK_RETURN_DOCUMENTS=false

Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection.

Docs and diagram implications

If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear:

  • openvino-reranker-npu/README.md: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code.
  • swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md: list :18818 as prototype/not enabled, with foreground smoke and NPU sysfs proof.
  • Service catalog / architecture notes: show live baseline :18810, :18816, :18817; show :18818 as optional second-stage RAG prototype, not live routing.
  • Diagrams: render RAG :18810 -> optional reranker :18818 as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it.
  • Optional systemd unit: document as installable after approval, not enabled by default.

No-go / defer criteria

Do not ship, enable, or integrate the reranker if any of these hold:

  • Port 18818 is already owned by another live service.
  • NPU is unavailable in ov.Core().available_devices or /sys/class/accel/accel0/device/npu_busy_time_us is missing.
  • Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU.
  • Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget.
  • Model export requires overwriting the existing model directory or touching Chroma/vector collections.
  • The service must bind beyond 127.0.0.1 to be useful.
  • Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval.
  • Logs or responses would persist raw private document text outside the existing RAG request path.

Current local preflight observed during this spec pass

  • /sys/class/accel/accel0/device/npu_busy_time_us is readable.
  • /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov is present.
  • /home/will/.venvs/openvino-reranker/bin/python is present.
  • :18818 was not listening during preflight.
  • server.py and smoke.py pass python -m py_compile.

These observations are preflight only; they are not a live service/NPU smoke result.