diff --git a/openvino-reranker-npu/README.md b/openvino-reranker-npu/README.md index 30194a4..d6d6186 100644 --- a/openvino-reranker-npu/README.md +++ b/openvino-reranker-npu/README.md @@ -12,6 +12,7 @@ This service is intentionally not wired into live RAG by default. ## Files +- `SPEC.md` — endpoint/CLI contract, model/runtime recommendation, smoke/NPU proof plan, RAG integration plan, docs implications, and no-go criteria. - `server.py` — stdlib HTTP OpenVINO Runtime service. - `smoke.py` — non-private API/ranking/NPU busy-time smoke test. - `openvino-reranker.service` — optional user-systemd unit. diff --git a/openvino-reranker-npu/SPEC.md b/openvino-reranker-npu/SPEC.md new file mode 100644 index 0000000..de40a03 --- /dev/null +++ b/openvino-reranker-npu/SPEC.md @@ -0,0 +1,243 @@ +# OpenVINO NPU reranker service spec + +Status: proposed localhost prototype; not live RAG integration. +Target port: `127.0.0.1:18818`. +Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval. + +## Recommendation + +Use `cross-encoder/ms-marco-MiniLM-L6-v2`, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in `server.py` on OpenVINO Runtime `NPU`. + +Why this choice: + +- It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass. +- The model shape is simple pairwise text classification/scoring: `(query, document) -> score`, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking. +- INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults: + `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov`. +- The service can fail closed on startup when `OPENVINO_RERANKER_DEVICE=NPU` but `/sys/class/accel/accel0/device/npu_busy_time_us` does not increase, preventing false "NPU-backed" claims. + +Runtime default: + +```text +OPENVINO_RERANKER_HOST=127.0.0.1 +OPENVINO_RERANKER_PORT=18818 +OPENVINO_RERANKER_DEVICE=NPU +OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2 +OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov +OPENVINO_RERANKER_MAX_LENGTH=512 +OPENVINO_RERANKER_MAX_DOCUMENTS=100 +OPENVINO_RERANKER_MAX_BODY_BYTES=5242880 +``` + +## Endpoint contract + +### Health and readiness + +`GET /healthz` and `GET /readyz` return JSON. + +`/readyz` must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive `npu_busy_delta_us`. + +Representative ready response: + +```json +{ + "status": "ok", + "ok": true, + "service": "openvino-reranker", + "model": "cross-encoder/ms-marco-MiniLM-L6-v2", + "model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov", + "device": "NPU", + "available_devices": ["CPU", "NPU"], + "max_length": 512, + "startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234}, + "last_inference": null, + "ready_error": null +} +``` + +### Rerank + +`POST /rerank` and compatibility alias `POST /v1/rerank` accept: + +```json +{ + "query": "how do I verify OpenVINO NPU usage?", + "documents": [ + {"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}}, + {"id": "bad", "text": "This note is about making sourdough starter."} + ], + "top_k": 2, + "return_documents": false +} +``` + +Compatibility notes: + +- `documents` may be strings or objects with `id`, `text`, and optional object `metadata`. +- `top_k` is preferred; `top_n` is accepted for common reranker-client compatibility. +- `return_documents=false` is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces. +- The optional `model` field may be sent by clients but is not used for routing; this sidecar serves one configured model. + +Successful response: + +```json +{ + "ok": true, + "model": "cross-encoder/ms-marco-MiniLM-L6-v2", + "device": "NPU", + "query": "how do I verify OpenVINO NPU usage?", + "input_count": 2, + "top_k": 2, + "duration_ms": 10.5, + "npu_busy_delta_us": 1234, + "results": [ + {"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997}, + {"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148} + ] +} +``` + +Error response shape: + +```json +{"ok": false, "error": "human-readable error", "results": []} +``` + +Status behavior: + +- 400: invalid JSON schema, empty query, missing/empty documents, invalid document text. +- 413: request body above `OPENVINO_RERANKER_MAX_BODY_BYTES`. +- 503: model not ready. +- 500: unexpected inference/runtime failure. + +## CLI contract + +Foreground-only review start: + +```bash +ss -ltnp | grep ':18818\b' || true +cat /sys/class/accel/accel0/device/npu_busy_time_us +source /home/will/.venvs/openvino-reranker/bin/activate +OPENVINO_RERANKER_HOST=127.0.0.1 \ +OPENVINO_RERANKER_PORT=18818 \ +OPENVINO_RERANKER_DEVICE=NPU \ +OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \ +python /home/will/lab/swarm/openvino-reranker-npu/server.py +``` + +Client smoke: + +```bash +source /home/will/.venvs/openvino-reranker/bin/activate +python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818 +``` + +Optional user-systemd unit exists as `openvino-reranker.service`, but this spec does not approve copying, starting, enabling, or wiring it into live paths. + +## Non-private smoke payload + +Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke. + +Minimum cases: + +1. Query: `how do I verify OpenVINO NPU usage?` + - Expected top document: `Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.` + - Distractor: `This note is about making sourdough starter.` +2. Query: `what port does the reranker service use?` + - Expected top document: `The OpenVINO reranker prototype listens locally on port 18818.` + - Distractor: `Whisper transcription accepts audio uploads.` +3. Query: `why should reranking not mutate vector collections?` + - Expected top document: `Reranking is a read-only second-stage transformation after vector search.` + - Distractor: `Boil pasta in salted water until al dente.` + +Pass criteria: + +- `/readyz` is HTTP 200 and reports `device=NPU`. +- Every case returns `ok=true` and a sorted `results` list with the expected top `id`. +- Response-level `npu_busy_delta_us` is positive for each case. +- External sysfs `after - before` is positive for each case or at least for the full smoke batch. +- Smoke script exits 0 and prints JSON with `ok: true`. + +## NPU busy-time verification plan + +HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas. + +Procedure: + +```bash +BUSY=/sys/class/accel/accel0/device/npu_busy_time_us +before=$(cat "$BUSY") +curl -fsS http://127.0.0.1:18818/rerank \ + -H 'Content-Type: application/json' \ + -d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \ + | jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}' +after=$(cat "$BUSY") +echo "sysfs_npu_busy_delta_us=$((after-before))" +``` + +Acceptance: + +- `device == "NPU"`. +- Response `npu_busy_delta_us > 0`. +- Shell-computed `sysfs_npu_busy_delta_us > 0`. +- If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking. + +## Optional RAG second-stage integration plan (deferred) + +This is a plan only. Do not enable it in live RAG without explicit approval. + +Design: + +1. Keep existing vector search and Chroma collection `obsidian_bge_npu` unchanged. +2. Retrieve more candidates from current vector search, e.g. `initial_k=20`. +3. Send only request-time candidate snippets/ids to `http://127.0.0.1:18818/rerank`. +4. Use reranker order to choose final `top_k`, e.g. `5`. +5. On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like `rerank_error`; do not fail the whole RAG request unless explicitly configured. +6. Log counters and latency, but avoid logging raw private document text. + +Disabled-by-default knobs: + +```text +RAG_RERANK_ENABLED=false +RAG_RERANK_URL=http://127.0.0.1:18818/rerank +RAG_RERANK_INITIAL_K=20 +RAG_RERANK_TOP_K=5 +RAG_RERANK_TIMEOUT_MS=3000 +RAG_RERANK_REQUIRE_NPU_PROOF=true +RAG_RERANK_RETURN_DOCUMENTS=false +``` + +Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection. + +## Docs and diagram implications + +If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear: + +- `openvino-reranker-npu/README.md`: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code. +- `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md`: list `:18818` as prototype/not enabled, with foreground smoke and NPU sysfs proof. +- Service catalog / architecture notes: show live baseline `:18810`, `:18816`, `:18817`; show `:18818` as optional second-stage RAG prototype, not live routing. +- Diagrams: render `RAG :18810 -> optional reranker :18818` as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it. +- Optional systemd unit: document as installable after approval, not enabled by default. + +## No-go / defer criteria + +Do not ship, enable, or integrate the reranker if any of these hold: + +- Port `18818` is already owned by another live service. +- `NPU` is unavailable in `ov.Core().available_devices` or `/sys/class/accel/accel0/device/npu_busy_time_us` is missing. +- Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU. +- Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget. +- Model export requires overwriting the existing model directory or touching Chroma/vector collections. +- The service must bind beyond `127.0.0.1` to be useful. +- Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval. +- Logs or responses would persist raw private document text outside the existing RAG request path. + +## Current local preflight observed during this spec pass + +- `/sys/class/accel/accel0/device/npu_busy_time_us` is readable. +- `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov` is present. +- `/home/will/.venvs/openvino-reranker/bin/python` is present. +- `:18818` was not listening during preflight. +- `server.py` and `smoke.py` pass `python -m py_compile`. + +These observations are preflight only; they are not a live service/NPU smoke result.