feat(npu): add OpenVINO reranker prototype

2026-06-04 13:07:51 -07:00
parent 0a6f84fbf3
commit 0683253157
6 changed files with 1027 additions and 0 deletions
@@ -0,0 +1,150 @@
+# OpenVINO NPU reranker service
+
+Local-first cross-encoder reranker prototype for second-stage RAG ranking.
+
+- Default bind: `127.0.0.1:18818`
+- Default model: `cross-encoder/ms-marco-MiniLM-L6-v2`
+- Default device: `NPU`
+- Model cache: `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov/`
+- NPU proof: `/sys/class/accel/accel0/device/npu_busy_time_us` delta before/after inference
+
+This service is intentionally not wired into live RAG by default.
+
+## Files
+
+- `SPEC.md` — endpoint/CLI contract, model/runtime recommendation, smoke/NPU proof plan, RAG integration plan, docs implications, and no-go criteria.
+- `server.py` — stdlib HTTP OpenVINO Runtime service with fail-fast localhost listener conflict checks and request validation.
+- `smoke.py` — non-private API/ranking/NPU busy-time smoke test.
+- `tests/test_server_validation.py` — stdlib unit checks for request validation and listener conflict detection.
+- `openvino-reranker.service` — optional user-systemd unit.
+
+## One-time setup
+
+Use a separate venv so the existing Whisper/embeddings NPU venv is not perturbed:
+
+```bash
+python -m venv /home/will/.venvs/openvino-reranker
+source /home/will/.venvs/openvino-reranker/bin/activate
+python -m pip install -U pip
+python -m pip install "openvino>=2026.2" "optimum-intel[openvino]" transformers tokenizers nncf numpy
+```
+
+Export the model:
+
+```bash
+source /home/will/.venvs/openvino-reranker/bin/activate
+optimum-cli export openvino \
+  --model cross-encoder/ms-marco-MiniLM-L6-v2 \
+  --task text-classification \
+  --weight-format int8 \
+  --trust-remote-code false \
+  /home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
+```
+
+If INT8 export or NPU compile fails, export an FP16/FP32 IR to a separate directory and point `OPENVINO_RERANKER_MODEL_DIR` at it while debugging. Do not overwrite existing vector/RAG/Chroma collections.
+
+## Run in foreground
+
+Check the port and NPU counter first:
+
+```bash
+ss -ltnp | grep ':18818 ' || true
+cat /sys/class/accel/accel0/device/npu_busy_time_us
+```
+
+Start locally:
+
+```bash
+source /home/will/.venvs/openvino-reranker/bin/activate
+OPENVINO_RERANKER_HOST=127.0.0.1 \
+OPENVINO_RERANKER_PORT=18818 \
+OPENVINO_RERANKER_DEVICE=NPU \
+OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
+python /home/will/lab/swarm/openvino-reranker-npu/server.py
+```
+
+Startup performs a non-private smoke inference and fails closed when `OPENVINO_RERANKER_DEVICE=NPU` but `npu_busy_time_us` does not increase. It also checks whether the requested listener can bind before compiling the OpenVINO model, so obvious port conflicts fail fast; the real server bind still happens immediately after model load.
+
+## API
+
+Health:
+
+```bash
+curl -sS http://127.0.0.1:18818/healthz | jq
+curl -sS http://127.0.0.1:18818/readyz | jq
+```
+
+Rerank:
+
+```bash
+curl -sS http://127.0.0.1:18818/rerank \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "query":"how do I verify OpenVINO NPU usage?",
+    "documents":[
+      {"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},
+      {"id":"bad","text":"This note is about making sourdough starter."}
+    ],
+    "top_k":2
+  }' | jq
+```
+
+Compatibility alias:
+
+```bash
+curl -sS http://127.0.0.1:18818/v1/rerank \
+  -H 'Content-Type: application/json' \
+  -d '{"model":"local-reranker","query":"npu busy time","documents":["OpenVINO NPU busy time proves accelerator use."],"top_n":1}' | jq
+```
+
+## Smoke test
+
+```bash
+source /home/will/.venvs/openvino-reranker/bin/activate
+python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
+```
+
+Expected:
+
+- `/readyz` is HTTP 200 and reports `device=NPU`.
+- Each fixture returns `ok=true` and a sorted `results` list.
+- The top result matches the non-private fixture expectation.
+- Response and sysfs `npu_busy_delta_us` are positive.
+
+## Validation checks
+
+```bash
+source /home/will/.venvs/openvino-reranker/bin/activate
+PYTHONPATH=/home/will/lab/swarm/openvino-reranker-npu \
+  python -m unittest discover -s /home/will/lab/swarm/openvino-reranker-npu/tests
+```
+
+These checks do not compile the OpenVINO model; they cover request validation and fail-fast listener conflict detection.
+
+## Optional systemd user service
+
+Install the unit only after the foreground command and smoke test pass:
+
+```bash
+cp /home/will/lab/swarm/openvino-reranker-npu/openvino-reranker.service /home/will/.config/systemd/user/openvino-reranker.service
+systemctl --user daemon-reload
+systemctl --user start openvino-reranker.service
+systemctl --user status openvino-reranker.service --no-pager
+journalctl --user -u openvino-reranker.service -n 100 --no-pager
+```
+
+Do not enable or integrate it into live RAG without explicit approval.
+
+## Optional RAG integration plan (disabled by default)
+
+RAG should keep vector search against `obsidian_bge_npu` unchanged, retrieve a larger candidate set, and call this service as a read-only request-time second stage. Suggested disabled-by-default knobs:
+
+```text
+RAG_RERANK_ENABLED=false
+RAG_RERANK_URL=http://127.0.0.1:18818/rerank
+RAG_RERANK_INITIAL_K=20
+RAG_RERANK_TOP_K=5
+RAG_RERANK_TIMEOUT_MS=3000
+```
+
+On reranker timeout/error, fall back to vector order and include metadata such as `rerank_error`; do not mutate or reindex Chroma collections.
@@ -0,0 +1,243 @@
+# OpenVINO NPU reranker service spec
+
+Status: proposed localhost prototype; not live RAG integration.
+Target port: `127.0.0.1:18818`.
+Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval.
+
+## Recommendation
+
+Use `cross-encoder/ms-marco-MiniLM-L6-v2`, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in `server.py` on OpenVINO Runtime `NPU`.
+
+Why this choice:
+
+- It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass.
+- The model shape is simple pairwise text classification/scoring: `(query, document) -> score`, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking.
+- INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults:
+  `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov`.
+- The service can fail closed on startup when `OPENVINO_RERANKER_DEVICE=NPU` but `/sys/class/accel/accel0/device/npu_busy_time_us` does not increase, preventing false "NPU-backed" claims.
+
+Runtime default:
+
+```text
+OPENVINO_RERANKER_HOST=127.0.0.1
+OPENVINO_RERANKER_PORT=18818
+OPENVINO_RERANKER_DEVICE=NPU
+OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
+OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
+OPENVINO_RERANKER_MAX_LENGTH=512
+OPENVINO_RERANKER_MAX_DOCUMENTS=100
+OPENVINO_RERANKER_MAX_BODY_BYTES=5242880
+```
+
+## Endpoint contract
+
+### Health and readiness
+
+`GET /healthz` and `GET /readyz` return JSON.
+
+`/readyz` must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive `npu_busy_delta_us`.
+
+Representative ready response:
+
+```json
+{
+  "status": "ok",
+  "ok": true,
+  "service": "openvino-reranker",
+  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
+  "model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov",
+  "device": "NPU",
+  "available_devices": ["CPU", "NPU"],
+  "max_length": 512,
+  "startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234},
+  "last_inference": null,
+  "ready_error": null
+}
+```
+
+### Rerank
+
+`POST /rerank` and compatibility alias `POST /v1/rerank` accept:
+
+```json
+{
+  "query": "how do I verify OpenVINO NPU usage?",
+  "documents": [
+    {"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}},
+    {"id": "bad", "text": "This note is about making sourdough starter."}
+  ],
+  "top_k": 2,
+  "return_documents": false
+}
+```
+
+Compatibility notes:
+
+- `documents` may be strings or objects with `id`, `text`, and optional object `metadata`.
+- `top_k` is preferred; `top_n` is accepted for common reranker-client compatibility.
+- `return_documents=false` is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces.
+- The optional `model` field may be sent by clients but is not used for routing; this sidecar serves one configured model.
+
+Successful response:
+
+```json
+{
+  "ok": true,
+  "model": "cross-encoder/ms-marco-MiniLM-L6-v2",
+  "device": "NPU",
+  "query": "how do I verify OpenVINO NPU usage?",
+  "input_count": 2,
+  "top_k": 2,
+  "duration_ms": 10.5,
+  "npu_busy_delta_us": 1234,
+  "results": [
+    {"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997},
+    {"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148}
+  ]
+}
+```
+
+Error response shape:
+
+```json
+{"ok": false, "error": "human-readable error", "results": []}
+```
+
+Status behavior:
+
+- 400: invalid JSON schema, empty query, missing/empty documents, invalid document text, or non-positive/non-integer `top_k`/`top_n`.
+- 413: request body above `OPENVINO_RERANKER_MAX_BODY_BYTES`.
+- 503: model not ready.
+- 500: unexpected inference/runtime failure.
+
+## CLI contract
+
+Foreground-only review start:
+
+```bash
+ss -ltnp | grep ':18818\b' || true
+cat /sys/class/accel/accel0/device/npu_busy_time_us
+source /home/will/.venvs/openvino-reranker/bin/activate
+OPENVINO_RERANKER_HOST=127.0.0.1 \
+OPENVINO_RERANKER_PORT=18818 \
+OPENVINO_RERANKER_DEVICE=NPU \
+OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
+python /home/will/lab/swarm/openvino-reranker-npu/server.py
+```
+
+Client smoke:
+
+```bash
+source /home/will/.venvs/openvino-reranker/bin/activate
+python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
+```
+
+Optional user-systemd unit exists as `openvino-reranker.service`, but this spec does not approve copying, starting, enabling, or wiring it into live paths.
+
+## Non-private smoke payload
+
+Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke.
+
+Minimum cases:
+
+1. Query: `how do I verify OpenVINO NPU usage?`
+   - Expected top document: `Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.`
+   - Distractor: `This note is about making sourdough starter.`
+2. Query: `what port does the reranker service use?`
+   - Expected top document: `The OpenVINO reranker prototype listens locally on port 18818.`
+   - Distractor: `Whisper transcription accepts audio uploads.`
+3. Query: `why should reranking not mutate vector collections?`
+   - Expected top document: `Reranking is a read-only second-stage transformation after vector search.`
+   - Distractor: `Boil pasta in salted water until al dente.`
+
+Pass criteria:
+
+- `/readyz` is HTTP 200 and reports `device=NPU`.
+- Every case returns `ok=true` and a sorted `results` list with the expected top `id`.
+- Response-level `npu_busy_delta_us` is positive for each case.
+- External sysfs `after - before` is positive for each case or at least for the full smoke batch.
+- Smoke script exits 0 and prints JSON with `ok: true`.
+
+## NPU busy-time verification plan
+
+HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas.
+
+Procedure:
+
+```bash
+BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
+before=$(cat "$BUSY")
+curl -fsS http://127.0.0.1:18818/rerank \
+  -H 'Content-Type: application/json' \
+  -d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \
+  | jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}'
+after=$(cat "$BUSY")
+echo "sysfs_npu_busy_delta_us=$((after-before))"
+```
+
+Acceptance:
+
+- `device == "NPU"`.
+- Response `npu_busy_delta_us > 0`.
+- Shell-computed `sysfs_npu_busy_delta_us > 0`.
+- If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking.
+
+## Optional RAG second-stage integration plan (deferred)
+
+This is a plan only. Do not enable it in live RAG without explicit approval.
+
+Design:
+
+1. Keep existing vector search and Chroma collection `obsidian_bge_npu` unchanged.
+2. Retrieve more candidates from current vector search, e.g. `initial_k=20`.
+3. Send only request-time candidate snippets/ids to `http://127.0.0.1:18818/rerank`.
+4. Use reranker order to choose final `top_k`, e.g. `5`.
+5. On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like `rerank_error`; do not fail the whole RAG request unless explicitly configured.
+6. Log counters and latency, but avoid logging raw private document text.
+
+Disabled-by-default knobs:
+
+```text
+RAG_RERANK_ENABLED=false
+RAG_RERANK_URL=http://127.0.0.1:18818/rerank
+RAG_RERANK_INITIAL_K=20
+RAG_RERANK_TOP_K=5
+RAG_RERANK_TIMEOUT_MS=3000
+RAG_RERANK_REQUIRE_NPU_PROOF=true
+RAG_RERANK_RETURN_DOCUMENTS=false
+```
+
+Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection.
+
+## Docs and diagram implications
+
+If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear:
+
+- `openvino-reranker-npu/README.md`: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code.
+- `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md`: list `:18818` as prototype/not enabled, with foreground smoke and NPU sysfs proof.
+- Service catalog / architecture notes: show live baseline `:18810`, `:18816`, `:18817`; show `:18818` as optional second-stage RAG prototype, not live routing.
+- Diagrams: render `RAG :18810 -> optional reranker :18818` as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it.
+- Optional systemd unit: document as installable after approval, not enabled by default.
+
+## No-go / defer criteria
+
+Do not ship, enable, or integrate the reranker if any of these hold:
+
+- Port `18818` is already owned by another live service.
+- `NPU` is unavailable in `ov.Core().available_devices` or `/sys/class/accel/accel0/device/npu_busy_time_us` is missing.
+- Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU.
+- Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget.
+- Model export requires overwriting the existing model directory or touching Chroma/vector collections.
+- The service must bind beyond `127.0.0.1` to be useful.
+- Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval.
+- Logs or responses would persist raw private document text outside the existing RAG request path.
+
+## Current local preflight observed during this spec pass
+
+- `/sys/class/accel/accel0/device/npu_busy_time_us` is readable.
+- `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov` is present.
+- `/home/will/.venvs/openvino-reranker/bin/python` is present.
+- `:18818` was not listening during preflight.
+- `server.py` and `smoke.py` pass `python -m py_compile`.
+
+These observations are preflight only; they are not a live service/NPU smoke result.
@@ -0,0 +1,19 @@
+[Unit]
+Description=OpenVINO NPU Reranker HTTP Service (port 18818)
+After=network-online.target
+
+[Service]
+Type=simple
+WorkingDirectory=/home/will/lab/swarm/openvino-reranker-npu
+Environment=OPENVINO_RERANKER_HOST=127.0.0.1
+Environment=OPENVINO_RERANKER_PORT=18818
+Environment=OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
+Environment=OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
+Environment=OPENVINO_RERANKER_DEVICE=NPU
+Environment=OPENVINO_RERANKER_MAX_LENGTH=512
+ExecStart=/home/will/.venvs/openvino-reranker/bin/python /home/will/lab/swarm/openvino-reranker-npu/server.py
+Restart=on-failure
+RestartSec=5
+
+[Install]
+WantedBy=default.target
@@ -0,0 +1,393 @@
+#!/usr/bin/env python3
+"""OpenVINO NPU cross-encoder reranker HTTP service.
+
+Default port: 18818
+Default model: cross-encoder/ms-marco-MiniLM-L6-v2 exported as OpenVINO IR
+Default device: NPU
+
+Endpoints:
+  GET  /, /healthz, /readyz
+  POST /rerank
+  POST /v1/rerank
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import os
+import socket
+import sys
+import threading
+import time
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from pathlib import Path
+from typing import Any
+
+import numpy as np
+import openvino as ov
+from transformers import AutoTokenizer
+
+DEFAULT_MODEL_ID = "cross-encoder/ms-marco-MiniLM-L6-v2"
+DEFAULT_MODEL_DIR = Path("/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov")
+DEFAULT_PORT = 18818
+DEFAULT_MAX_LENGTH = 512
+DEFAULT_MAX_DOCUMENTS = 100
+DEFAULT_MAX_BODY_BYTES = 5 * 1024 * 1024
+NPU_BUSY_FILE = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
+
+
+def npu_busy_time_us() -> int | None:
+    try:
+        return int(NPU_BUSY_FILE.read_text().strip())
+    except Exception:
+        return None
+
+
+def sigmoid(x: float) -> float:
+    if x >= 0:
+        z = math.exp(-x)
+        return 1.0 / (1.0 + z)
+    z = math.exp(x)
+    return z / (1.0 + z)
+
+
+def softmax_prob(logits: np.ndarray, index: int = 1) -> float:
+    row = np.asarray(logits, dtype=np.float64).reshape(-1)
+    shifted = row - np.max(row)
+    probs = np.exp(shifted) / np.sum(np.exp(shifted))
+    return float(probs[index])
+
+
+class RerankerService:
+    def __init__(
+        self,
+        model_dir: Path,
+        model_id: str,
+        device: str,
+        max_length: int,
+        startup_smoke: bool = True,
+    ) -> None:
+        self.model_dir = model_dir
+        self.model_id = model_id
+        self.device = device
+        self.max_length = int(max_length)
+        self.loaded_at = time.time()
+        self.lock = threading.Lock()
+        self.last_inference: dict[str, Any] | None = None
+        self.startup_smoke: dict[str, Any] | None = None
+        self.ready = False
+        self.ready_error: str | None = None
+
+        if not self.model_dir.exists():
+            raise FileNotFoundError(f"model directory not found: {self.model_dir}")
+
+        self.core = ov.Core()
+        self.available_devices = list(self.core.available_devices)
+        if self.device not in self.available_devices:
+            raise RuntimeError(f"OpenVINO device {self.device!r} unavailable; available={self.available_devices}")
+
+        xml_path = self.model_dir / "openvino_model.xml"
+        if not xml_path.exists():
+            raise FileNotFoundError(f"OpenVINO IR not found: {xml_path}")
+
+        self.tokenizer = AutoTokenizer.from_pretrained(str(self.model_dir), local_files_only=True)
+        model = self.core.read_model(str(xml_path))
+        self._reshape_static(model)
+        self.compiled = self.core.compile_model(model, self.device)
+        self.input_names = {inp.get_any_name() for inp in self.compiled.inputs}
+        self.output = self.compiled.output(0)
+
+        if startup_smoke:
+            try:
+                smoke = self.rerank(
+                    "npu busy time",
+                    [{"id": "smoke", "text": "OpenVINO NPU usage is verified by npu_busy_time_us."}],
+                    top_k=1,
+                    return_documents=False,
+                )
+                self.startup_smoke = {
+                    "ok": bool(smoke.get("ok")),
+                    "duration_ms": smoke.get("duration_ms"),
+                    "npu_busy_delta_us": smoke.get("npu_busy_delta_us"),
+                }
+                if self.device == "NPU" and int(smoke.get("npu_busy_delta_us") or 0) <= 0:
+                    raise RuntimeError("startup smoke did not increase npu_busy_time_us")
+            except Exception as exc:
+                self.ready_error = f"startup smoke failed: {type(exc).__name__}: {exc}"
+                raise
+
+        self.ready = True
+
+    def _reshape_static(self, model: ov.Model) -> None:
+        shape_by_name: dict[str, list[int]] = {}
+        for inp in model.inputs:
+            name = inp.get_any_name()
+            if name in {"input_ids", "attention_mask", "token_type_ids"}:
+                shape_by_name[name] = [1, self.max_length]
+        if shape_by_name:
+            model.reshape(shape_by_name)
+
+    def _tokenize(self, query: str, document: str) -> dict[str, np.ndarray]:
+        tokens = self.tokenizer(
+            query,
+            document,
+            max_length=self.max_length,
+            padding="max_length",
+            truncation=True,
+            return_tensors="np",
+        )
+        return {name: np.asarray(value) for name, value in tokens.items() if name in self.input_names}
+
+    def _score_pair(self, query: str, document: str) -> dict[str, float | None]:
+        inputs = self._tokenize(query, document)
+        missing = self.input_names - set(inputs)
+        # Some exported BERT models do not use token_type_ids. input_ids and attention_mask are required.
+        required_missing = missing & {"input_ids", "attention_mask"}
+        if required_missing:
+            raise RuntimeError(f"tokenizer did not produce required inputs: {sorted(required_missing)}")
+        outputs = self.compiled(inputs)
+        logits = np.asarray(outputs[self.output])
+        flat = logits.reshape(-1)
+        if flat.size == 1:
+            raw = float(flat[0])
+            return {"score": raw, "raw_score": raw, "probability": sigmoid(raw)}
+        if flat.size >= 2:
+            raw = float(flat[1])
+            return {"score": raw, "raw_score": raw, "probability": softmax_prob(flat, 1)}
+        raise RuntimeError(f"unexpected empty logits shape: {list(logits.shape)}")
+
+    def rerank(
+        self,
+        query: str,
+        documents: list[dict[str, Any]],
+        *,
+        top_k: int | None,
+        return_documents: bool = True,
+    ) -> dict[str, Any]:
+        before = npu_busy_time_us()
+        started = time.perf_counter()
+        results: list[dict[str, Any]] = []
+        with self.lock:
+            for idx, doc in enumerate(documents):
+                scored = self._score_pair(query, str(doc["text"]))
+                item: dict[str, Any] = {
+                    "index": idx,
+                    "score": scored["score"],
+                    "raw_score": scored["raw_score"],
+                    "probability": scored["probability"],
+                }
+                if doc.get("id") is not None:
+                    item["id"] = doc.get("id")
+                if return_documents:
+                    item["text"] = doc["text"]
+                    item["metadata"] = doc.get("metadata") if isinstance(doc.get("metadata"), dict) else {}
+                results.append(item)
+        after = npu_busy_time_us()
+        results.sort(key=lambda item: (-float(item["score"]), int(item["index"])))
+        clamped_top_k = len(results) if top_k is None else max(1, min(int(top_k), len(results)))
+        duration_ms = round((time.perf_counter() - started) * 1000, 3)
+        npu_delta = None if before is None or after is None else after - before
+        payload = {
+            "ok": True,
+            "model": self.model_id,
+            "model_dir": str(self.model_dir),
+            "device": self.device,
+            "query": query,
+            "input_count": len(documents),
+            "top_k": clamped_top_k,
+            "duration_ms": duration_ms,
+            "npu_busy_delta_us": npu_delta,
+            "results": results[:clamped_top_k],
+        }
+        self.last_inference = {
+            "duration_ms": duration_ms,
+            "docs": len(documents),
+            "npu_busy_delta_us": npu_delta,
+        }
+        return payload
+
+    def health(self) -> dict[str, Any]:
+        status = "ok" if self.ready else "degraded"
+        return {
+            "status": status,
+            "ok": self.ready,
+            "service": "openvino-reranker",
+            "model": self.model_id,
+            "model_dir": str(self.model_dir),
+            "device": self.device,
+            "available_devices": self.available_devices,
+            "max_length": self.max_length,
+            "input_names": sorted(self.input_names),
+            "uptime_s": round(time.time() - self.loaded_at, 3),
+            "npu_busy_time_us": npu_busy_time_us(),
+            "startup_smoke": self.startup_smoke,
+            "last_inference": self.last_inference,
+            "ready_error": self.ready_error,
+        }
+
+
+def normalize_documents(value: Any, max_documents: int) -> list[dict[str, Any]]:
+    if not isinstance(value, list) or not value:
+        raise ValueError("documents must be a non-empty list")
+    if len(value) > max_documents:
+        raise ValueError(f"documents exceeds max_documents={max_documents}")
+    docs: list[dict[str, Any]] = []
+    for idx, item in enumerate(value):
+        if isinstance(item, str):
+            text = item
+            doc: dict[str, Any] = {"text": text}
+        elif isinstance(item, dict):
+            text = item.get("text")
+            doc = {
+                "id": item.get("id"),
+                "text": text,
+                "metadata": item.get("metadata") if isinstance(item.get("metadata"), dict) else {},
+            }
+        else:
+            raise ValueError(f"documents[{idx}] must be a string or object")
+        if not isinstance(text, str) or not text.strip():
+            raise ValueError(f"documents[{idx}].text must be a non-empty string")
+        docs.append(doc)
+    return docs
+
+
+def parse_top_k(value: Any, document_count: int) -> int:
+    """Validate top_k/top_n before inference so schema errors return HTTP 400."""
+    if value is None:
+        return document_count
+    if isinstance(value, bool) or not isinstance(value, int):
+        raise ValueError("top_k/top_n must be a positive integer")
+    if value < 1:
+        raise ValueError("top_k/top_n must be a positive integer")
+    return min(value, document_count)
+
+
+def assert_port_available(host: str, port: int) -> None:
+    """Fail fast on listener conflicts before compiling the OpenVINO model."""
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
+        sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+        try:
+            sock.bind((host, port))
+        except OSError as exc:
+            raise RuntimeError(f"cannot bind {host}:{port}; listener conflict or invalid bind: {exc}") from exc
+
+
+class Handler(BaseHTTPRequestHandler):
+    server_version = "OpenVINOReranker/0.1"
+
+    @property
+    def svc(self) -> RerankerService:
+        return self.server.reranker_service  # type: ignore[attr-defined]
+
+    @property
+    def max_body_bytes(self) -> int:
+        return self.server.max_body_bytes  # type: ignore[attr-defined]
+
+    @property
+    def max_documents(self) -> int:
+        return self.server.max_documents  # type: ignore[attr-defined]
+
+    def do_GET(self) -> None:
+        path = self.path.split("?", 1)[0].rstrip("/") or "/"
+        if path == "/":
+            self.write_json({"ok": True, "service": "openvino-reranker", "endpoints": ["/healthz", "/readyz", "/rerank", "/v1/rerank"]})
+        elif path in {"/healthz", "/health"}:
+            self.write_json(self.svc.health(), status=200)
+        elif path == "/readyz":
+            health = self.svc.health()
+            self.write_json(health, status=200 if health.get("ok") else 503)
+        else:
+            self.write_json({"ok": False, "error": "not found", "results": []}, status=404)
+
+    def do_POST(self) -> None:
+        path = self.path.split("?", 1)[0].rstrip("/") or "/"
+        try:
+            if path not in {"/rerank", "/v1/rerank"}:
+                self.write_json({"ok": False, "error": "not found", "results": []}, status=404)
+                return
+            if not self.svc.ready:
+                self.write_json({"ok": False, "error": self.svc.ready_error or "model not ready", "results": []}, status=503)
+                return
+            payload = self.read_json()
+            query = payload.get("query")
+            if not isinstance(query, str) or not query.strip():
+                raise ValueError("query is required")
+            top_k = payload.get("top_k", payload.get("top_n"))
+            documents = normalize_documents(payload.get("documents"), self.max_documents)
+            top_k = parse_top_k(top_k, len(documents))
+            return_documents = bool(payload.get("return_documents", True))
+            response = self.svc.rerank(query.strip(), documents, top_k=top_k, return_documents=return_documents)
+            self.write_json(response)
+        except RequestTooLarge as exc:
+            self.write_json({"ok": False, "error": str(exc), "results": []}, status=413)
+        except ValueError as exc:
+            self.write_json({"ok": False, "error": str(exc), "results": []}, status=400)
+        except Exception as exc:
+            self.write_json({"ok": False, "error": f"{type(exc).__name__}: {exc}", "results": []}, status=500)
+
+    def read_json(self) -> dict[str, Any]:
+        length = int(self.headers.get("Content-Length") or 0)
+        if length > self.max_body_bytes:
+            raise RequestTooLarge(f"request body exceeds {self.max_body_bytes} bytes")
+        body = self.rfile.read(length).decode("utf-8", "replace") if length else "{}"
+        payload = json.loads(body or "{}")
+        if not isinstance(payload, dict):
+            raise ValueError("JSON body must be an object")
+        return payload
+
+    def write_json(self, payload: dict[str, Any], status: int = 200) -> None:
+        body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
+        self.send_response(status)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Content-Length", str(len(body)))
+        self.end_headers()
+        self.wfile.write(body)
+
+    def log_message(self, format: str, *args: Any) -> None:  # noqa: A002 - stdlib override name
+        print(f"{self.address_string()} - {format % args}", file=sys.stderr, flush=True)
+
+
+class RequestTooLarge(ValueError):
+    pass
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--host", default=os.environ.get("OPENVINO_RERANKER_HOST", "127.0.0.1"))
+    parser.add_argument("--port", type=int, default=int(os.environ.get("OPENVINO_RERANKER_PORT", DEFAULT_PORT)))
+    parser.add_argument("--model-dir", default=os.environ.get("OPENVINO_RERANKER_MODEL_DIR", str(DEFAULT_MODEL_DIR)))
+    parser.add_argument("--model", default=os.environ.get("OPENVINO_RERANKER_MODEL", DEFAULT_MODEL_ID))
+    parser.add_argument("--device", default=os.environ.get("OPENVINO_RERANKER_DEVICE", "NPU"))
+    parser.add_argument("--max-length", type=int, default=int(os.environ.get("OPENVINO_RERANKER_MAX_LENGTH", str(DEFAULT_MAX_LENGTH))))
+    parser.add_argument("--max-documents", type=int, default=int(os.environ.get("OPENVINO_RERANKER_MAX_DOCUMENTS", str(DEFAULT_MAX_DOCUMENTS))))
+    parser.add_argument("--max-body-bytes", type=int, default=int(os.environ.get("OPENVINO_RERANKER_MAX_BODY_BYTES", str(DEFAULT_MAX_BODY_BYTES))))
+    parser.add_argument("--skip-startup-smoke", action="store_true", default=os.environ.get("OPENVINO_RERANKER_SKIP_STARTUP_SMOKE", "").lower() in {"1", "true", "yes"})
+    args = parser.parse_args()
+
+    assert_port_available(args.host, args.port)
+    service = RerankerService(
+        Path(args.model_dir).expanduser(),
+        args.model,
+        args.device,
+        args.max_length,
+        startup_smoke=not args.skip_startup_smoke,
+    )
+    httpd = ThreadingHTTPServer((args.host, args.port), Handler)
+    httpd.reranker_service = service  # type: ignore[attr-defined]
+    httpd.max_body_bytes = args.max_body_bytes  # type: ignore[attr-defined]
+    httpd.max_documents = args.max_documents  # type: ignore[attr-defined]
+    print(
+        f"openvino-reranker listening on {args.host}:{args.port} model={args.model} "
+        f"model_dir={args.model_dir} device={args.device} max_length={args.max_length}",
+        flush=True,
+    )
+    try:
+        httpd.serve_forever()
+    except KeyboardInterrupt:
+        pass
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -0,0 +1,167 @@
+#!/usr/bin/env python3
+"""Smoke/benchmark checks for the OpenVINO reranker service.
+
+Prints a JSON summary and exits non-zero on schema/ranking/NPU verification failure.
+Uses only non-private fixture text.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+import sys
+import time
+import urllib.error
+import urllib.request
+from pathlib import Path
+from typing import Any
+
+NPU_BUSY_FILE = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
+
+FIXTURES = [
+    {
+        "query": "how do I verify OpenVINO NPU usage?",
+        "documents": [
+            {"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},
+            {"id": "bad", "text": "This note is about making sourdough starter."},
+        ],
+        "expected_top_id": "good",
+    },
+    {
+        "query": "what port does the reranker service use?",
+        "documents": [
+            {"id": "unrelated", "text": "Whisper transcription accepts audio uploads."},
+            {"id": "port", "text": "The OpenVINO reranker prototype listens locally on port 18818."},
+        ],
+        "expected_top_id": "port",
+    },
+    {
+        "query": "why should reranking not mutate vector collections?",
+        "documents": [
+            {"id": "mutation", "text": "Reranking is a read-only second-stage transformation after vector search."},
+            {"id": "cooking", "text": "Boil pasta in salted water until al dente."},
+        ],
+        "expected_top_id": "mutation",
+    },
+]
+
+
+def npu_busy_time_us() -> int | None:
+    try:
+        return int(NPU_BUSY_FILE.read_text().strip())
+    except Exception:
+        return None
+
+
+def post_json(url: str, payload: dict[str, Any], timeout: float) -> tuple[int, dict[str, Any]]:
+    data = json.dumps(payload).encode("utf-8")
+    req = urllib.request.Request(url, data=data, headers={"Content-Type": "application/json"}, method="POST")
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            body = resp.read().decode("utf-8", "replace")
+            return resp.status, json.loads(body)
+    except urllib.error.HTTPError as exc:
+        body = exc.read().decode("utf-8", "replace")
+        try:
+            parsed = json.loads(body)
+        except Exception:
+            parsed = {"error": body}
+        return exc.code, parsed
+
+
+def get_json(url: str, timeout: float) -> tuple[int, dict[str, Any]]:
+    try:
+        with urllib.request.urlopen(url, timeout=timeout) as resp:
+            body = resp.read().decode("utf-8", "replace")
+            return resp.status, json.loads(body)
+    except urllib.error.HTTPError as exc:
+        body = exc.read().decode("utf-8", "replace")
+        try:
+            parsed = json.loads(body)
+        except Exception:
+            parsed = {"error": body}
+        return exc.code, parsed
+
+
+def percentile(values: list[float], pct: float) -> float | None:
+    if not values:
+        return None
+    ordered = sorted(values)
+    idx = min(len(ordered) - 1, max(0, round((pct / 100.0) * (len(ordered) - 1))))
+    return round(ordered[idx], 3)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--url", default="http://127.0.0.1:18818")
+    parser.add_argument("--timeout", type=float, default=20.0)
+    parser.add_argument("--allow-cpu", action="store_true", help="do not fail when health reports a non-NPU device")
+    args = parser.parse_args()
+
+    base = args.url.rstrip("/")
+    failures: list[str] = []
+    health_status, health = get_json(f"{base}/readyz", args.timeout)
+    if health_status != 200 or not health.get("ok"):
+        failures.append(f"readyz failed status={health_status} error={health.get('ready_error') or health.get('error')}")
+    device = health.get("device")
+    if device != "NPU" and not args.allow_cpu:
+        failures.append(f"device is {device!r}, expected 'NPU'")
+
+    latencies: list[float] = []
+    response_npu_total = 0
+    sysfs_npu_total = 0
+    top1_passed = 0
+
+    for case in FIXTURES:
+        before = npu_busy_time_us()
+        started = time.perf_counter()
+        status, payload = post_json(
+            f"{base}/rerank",
+            {"query": case["query"], "documents": case["documents"], "top_k": len(case["documents"]), "return_documents": False},
+            args.timeout,
+        )
+        wall_ms = (time.perf_counter() - started) * 1000
+        after = npu_busy_time_us()
+        latencies.append(float(payload.get("duration_ms") or wall_ms))
+        response_delta = payload.get("npu_busy_delta_us")
+        sysfs_delta = None if before is None or after is None else after - before
+        if isinstance(response_delta, int):
+            response_npu_total += response_delta
+        if isinstance(sysfs_delta, int):
+            sysfs_npu_total += sysfs_delta
+        results = payload.get("results") if isinstance(payload, dict) else None
+        top_id = results[0].get("id") if isinstance(results, list) and results else None
+        if status != 200 or not payload.get("ok"):
+            failures.append(f"case {case['expected_top_id']} HTTP/status failed: status={status} error={payload.get('error')}")
+        if not isinstance(results, list) or len(results) != len(case["documents"]):
+            failures.append(f"case {case['expected_top_id']} returned invalid results")
+        if top_id == case["expected_top_id"]:
+            top1_passed += 1
+        else:
+            failures.append(f"case {case['expected_top_id']} top_id={top_id!r}")
+        if device == "NPU":
+            if not isinstance(response_delta, int) or response_delta <= 0:
+                failures.append(f"case {case['expected_top_id']} response npu delta not positive: {response_delta}")
+            if not isinstance(sysfs_delta, int) or sysfs_delta <= 0:
+                failures.append(f"case {case['expected_top_id']} sysfs npu delta not positive: {sysfs_delta}")
+
+    summary = {
+        "ok": not failures,
+        "url": base,
+        "model": health.get("model"),
+        "device": device,
+        "cases": len(FIXTURES),
+        "top1_passed": top1_passed,
+        "p50_ms": percentile(latencies, 50),
+        "p95_ms": percentile(latencies, 95),
+        "mean_ms": round(statistics.mean(latencies), 3) if latencies else None,
+        "npu_busy_delta_us_total": sysfs_npu_total,
+        "response_npu_busy_delta_us_total": response_npu_total,
+        "failures": failures,
+    }
+    print(json.dumps(summary, indent=2, sort_keys=True))
+    return 0 if not failures else 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -0,0 +1,55 @@
+#!/usr/bin/env python3
+"""Unit checks for reranker request validation helpers.
+
+These tests intentionally avoid loading an OpenVINO model; they only cover the
+stdlib validation helpers used before inference.
+"""
+from __future__ import annotations
+
+import socket
+import unittest
+
+from server import assert_port_available, normalize_documents, parse_top_k
+
+
+class ValidationTests(unittest.TestCase):
+    def test_normalize_accepts_strings_and_objects(self) -> None:
+        docs = normalize_documents(
+            [
+                "plain text document",
+                {"id": "obj", "text": "object document", "metadata": {"source": "synthetic"}},
+            ],
+            max_documents=2,
+        )
+        self.assertEqual(docs[0], {"text": "plain text document"})
+        self.assertEqual(docs[1]["id"], "obj")
+        self.assertEqual(docs[1]["metadata"], {"source": "synthetic"})
+
+    def test_normalize_rejects_empty_or_too_many_documents(self) -> None:
+        with self.assertRaisesRegex(ValueError, "non-empty"):
+            normalize_documents([], max_documents=2)
+        with self.assertRaisesRegex(ValueError, "max_documents"):
+            normalize_documents(["a", "b", "c"], max_documents=2)
+        with self.assertRaisesRegex(ValueError, "non-empty string"):
+            normalize_documents([{"id": "empty", "text": ""}], max_documents=2)
+
+    def test_parse_top_k_defaults_clamps_and_rejects_invalid_values(self) -> None:
+        self.assertEqual(parse_top_k(None, document_count=3), 3)
+        self.assertEqual(parse_top_k(2, document_count=3), 2)
+        self.assertEqual(parse_top_k(99, document_count=3), 3)
+        for value in (0, -1, True, False, 1.5, "2", "nope"):
+            with self.subTest(value=value):
+                with self.assertRaisesRegex(ValueError, "positive integer"):
+                    parse_top_k(value, document_count=3)
+
+    def test_assert_port_available_detects_listener_conflict(self) -> None:
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as listener:
+            listener.bind(("127.0.0.1", 0))
+            listener.listen(1)
+            port = listener.getsockname()[1]
+            with self.assertRaisesRegex(RuntimeError, "cannot bind"):
+                assert_port_available("127.0.0.1", port)
+
+
+if __name__ == "__main__":
+    unittest.main()