merge: integrate OpenVINO NPU assistant services
This commit is contained in:
@@ -37,6 +37,7 @@ For the current host-side AI/search/voice automation stack, n8n watchdogs, and a
|
||||
- [`docs/swarm-infrastructure.md`](docs/swarm-infrastructure.md) — operational overview and quick checks
|
||||
- [`docs/swarm-infrastructure.html`](docs/swarm-infrastructure.html) — dark SVG architecture diagram
|
||||
- [`docs/diagram-maintenance.md`](docs/diagram-maintenance.md) — diagram upkeep conventions
|
||||
- OpenVINO NPU services and prototypes are documented in `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md` and the component READMEs under `openvino-*-npu*/`. Live baseline ports are RAG `:18810`, Whisper NPU `:18816`, and embeddings `:18817`; sidecar ports `:18818`, `:18819`, `:18820`, and optional doc/image triage `:18829` are approved prototypes only, not live Atlas/Hermes routing.
|
||||
|
||||
## VM: zap
|
||||
|
||||
|
||||
@@ -15,6 +15,7 @@ Update the relevant diagram in the same change set when you change any of these:
|
||||
- n8n workflow architecture
|
||||
- Hermes/Atlas routing or gateway responsibilities
|
||||
- local AI/search/voice endpoints
|
||||
- OpenVINO NPU live/prototype status, ports, or safety gates (`:18810`, `:18816`, `:18817`, `:18818`, `:18819`, `:18820`, optional `:18829`)
|
||||
- Obsidian/RAG data flow
|
||||
- OpenClaw/VM operational mode
|
||||
- ownership/source-of-truth paths for a component
|
||||
@@ -27,6 +28,7 @@ Create a new focused diagram when the existing overview would become too dense.
|
||||
- agentmon internals: collectors → NATS → processor → Postgres → query/UI
|
||||
- Obsidian/RAG automation pipeline
|
||||
- local AI routing: Hermes/LiteLLM/llama.cpp/Ollama/provider boundaries
|
||||
- OpenVINO NPU assistant sidecars, with live baseline and approved/not-live prototype lanes separated
|
||||
- messaging/channel routing: Telegram/Discord/email → Hermes/n8n/alerts
|
||||
- disaster recovery / backup topology
|
||||
|
||||
@@ -37,6 +39,7 @@ Create a new focused diagram when the existing overview would become too dense.
|
||||
- Link diagrams from the nearest README or operational doc.
|
||||
- Keep labels operational: service name, port, responsibility, and data direction.
|
||||
- Avoid secrets, credential names that imply secret values, private tokens, raw webhook URLs, or sensitive sample payloads.
|
||||
- Do not imply live Atlas/Hermes/RAG routing to an OpenVINO NPU prototype unless a reviewed implementation actually enabled it; label approved prototypes as `not live` or `approval required`.
|
||||
- If a raw export or live config was used to build the diagram, commit only the sanitized diagram/docs, not the raw sensitive source.
|
||||
|
||||
## Verification before committing
|
||||
|
||||
@@ -0,0 +1,388 @@
|
||||
# OpenVINO/NPU VLM, audio, and wake-word feasibility
|
||||
|
||||
Date: 2026-06-04
|
||||
Scope: feasibility/spec only for lower-priority assistant sidecars. This document does not enable services, alter Atlas/Hermes/gateway routing, mutate RAG/Chroma/vector collections, or process private document/image directories.
|
||||
|
||||
## Existing baseline and constraints
|
||||
|
||||
Live baseline discovered by parent task:
|
||||
|
||||
- RAG endpoint: `127.0.0.1:18810`
|
||||
- RAG health wrapper: `127.0.0.1:18814`
|
||||
- Whisper OpenVINO NPU: `127.0.0.1:18816`
|
||||
- OpenVINO embeddings: `127.0.0.1:18817`
|
||||
- Prototype ports currently reserved/not live: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, optional doc/image triage `:18829`
|
||||
|
||||
Local NPU runtime snapshot from the feasibility run:
|
||||
|
||||
- `/home/will/.venvs/npu` has `openvino==2026.2.0` and `openvino-genai==2026.2.0.0`.
|
||||
- `openvino.Core().available_devices` reports `CPU`, `GPU.0`, `GPU.1`, and `NPU`.
|
||||
- NPU device name: `Intel(R) AI Boost`.
|
||||
- NPU claims must be verified by positive `/sys/class/accel/accel0/device/npu_busy_time_us` deltas around inference.
|
||||
|
||||
External release/project signals checked:
|
||||
|
||||
- OpenVINO 2026.2.0 release notes mention broader GenAI coverage and VLM samples, but the VLM acceleration notes are CPU/GPU-oriented; they do not provide a clear low-risk NPU VLM path.
|
||||
- Prior OpenVINO release notes/search results mention OpenVINO Model Server VLM support for Qwen2-VL, Phi-3.5-Vision, and InternVL2.
|
||||
- `openWakeWord` is an active Apache-2.0 local wake-word framework with ONNX Runtime/TFLite support, pre-trained wake-word models, optional VAD, and 16 kHz PCM streaming examples. It is not installed in the current NPU venv.
|
||||
|
||||
## Recommendation summary
|
||||
|
||||
| Lane | Recommendation | Priority | Why |
|
||||
| --- | --- | --- | --- |
|
||||
| VLM / image captioning | Defer NPU-first VLM. If pursued, prototype CPU/GPU VLM CLI first, then attempt NPU only after model/runtime compatibility is proven. | Low | NPU support for VLMs is not clearly mature in the current OpenVINO public notes; VLMs are memory/op-shape heavy; failures could be slow and noisy. Existing doc/image triage already covers practical local image metadata without a full VLM. |
|
||||
| Lightweight image classification / caption fallback | Extend the existing `openvino-doc-image-triage-npu` lane before adding a new service. | Medium-low | It already has privacy boundaries, synthetic fixtures, CLI/server split, and NPU proof through embeddings. Add static-shape classifier only if a later task needs image labels beyond rule fallback. |
|
||||
| Audio classification | Defer until a concrete assistant workflow needs it. Consider CPU/GPU/OpenVINO Runtime prototype using Speech Commands/ESC-style classifier before any daemon. | Low | Whisper NPU already covers transcription. Generic audio tags are less useful without a routing/product requirement and need dataset-specific threshold tuning. |
|
||||
| Wake word | Worth a small CPU-only local smoke prototype; do not spend NPU time first. | Medium | Wake-word detection must be always-on, tiny, and reliable. CPU openWakeWord/ONNX/TFLite is the lowest-risk path and avoids starving existing NPU Whisper/embedding services. NPU use is only worth testing after CPU false-positive/latency behavior is acceptable. |
|
||||
|
||||
## VLM / image-captioning path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Initial runtime: CLI-first OpenVINO GenAI or OpenVINO Model Server on CPU/GPU, not NPU-first.
|
||||
|
||||
Candidate models to evaluate, in order:
|
||||
|
||||
1. `Qwen2-VL-2B-Instruct` OpenVINO/OVMS-compatible export if a small converted artifact is already available.
|
||||
2. `Phi-3.5-Vision-Instruct` only if memory/startup is acceptable.
|
||||
3. `InternVL2` only as a compatibility reference; likely too heavy for a low-priority local assistant sidecar.
|
||||
|
||||
Why this order:
|
||||
|
||||
- Qwen2-VL is broadly supported by OpenVINO Model Server release notes/search results and has smaller variants.
|
||||
- Phi-3.5-Vision is also named in OpenVINO Model Server VLM support, but may be heavier.
|
||||
- NPU is not the first target because public OpenVINO 2026.2 release notes emphasize VLM improvements for CPU/GPU, not NPU. Treat NPU VLM as experimental until a smoke test proves compilation and positive busy-time deltas.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
CLI-first contract:
|
||||
|
||||
```bash
|
||||
python vlm_caption.py \
|
||||
--image /path/to/synthetic_or_explicitly_allowed_image.png \
|
||||
--prompt "Describe this image in one sentence." \
|
||||
--device CPU \
|
||||
--max-new-tokens 96 \
|
||||
--json
|
||||
```
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"media_type": "image",
|
||||
"source_path_basename": "synthetic_scene.png",
|
||||
"source_sha256": "sha256:...",
|
||||
"model": "qwen2-vl-small-openvino",
|
||||
"runtime": "openvino-genai-or-ovms",
|
||||
"device_requested": "CPU",
|
||||
"device_observed": "CPU",
|
||||
"caption": "A synthetic chart with three colored bars.",
|
||||
"safety": {
|
||||
"external_uploads": false,
|
||||
"raw_image_logged": false,
|
||||
"private_paths_allowed": false
|
||||
},
|
||||
"timing_ms": {
|
||||
"load": 0,
|
||||
"inference": 0,
|
||||
"total": 0
|
||||
},
|
||||
"npu_busy_delta_us": null
|
||||
}
|
||||
```
|
||||
|
||||
Optional localhost HTTP contract, only after CLI is stable:
|
||||
|
||||
- Bind: `127.0.0.1:18829` or another explicitly approved unused prototype port.
|
||||
- `GET /healthz`
|
||||
- `GET /models`
|
||||
- `POST /v1/vision/caption`
|
||||
|
||||
Request body:
|
||||
|
||||
```json
|
||||
{
|
||||
"path": "/allowed/root/synthetic_scene.png",
|
||||
"prompt": "Describe this image in one sentence.",
|
||||
"max_new_tokens": 96,
|
||||
"device": "CPU"
|
||||
}
|
||||
```
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
Use only generated fixtures under the repo, similar to `openvino-doc-image-triage-npu/samples/`:
|
||||
|
||||
1. Create synthetic PNGs: simple chart, receipt-like image, screenshot-like text panel, and blank/noisy image.
|
||||
2. Run CLI with `--allowed-root "$PWD/samples"` and assert:
|
||||
- JSON parses.
|
||||
- `external_uploads=false`.
|
||||
- only basename and SHA-256 are returned by default.
|
||||
- captions are non-empty and under a configured token/character limit.
|
||||
- unsupported/private paths are rejected.
|
||||
3. If an HTTP server is added, start it in foreground on `127.0.0.1`, call `/healthz` and `/v1/vision/caption`, then stop it.
|
||||
4. No private image/document folders and no Obsidian vault content should be used for smoke tests.
|
||||
|
||||
### NPU busy-time verification plan
|
||||
|
||||
Only claim NPU VLM if all of these pass:
|
||||
|
||||
1. Verify the counter is readable:
|
||||
|
||||
```bash
|
||||
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
||||
test -r "$BUSY" && before=$(cat "$BUSY")
|
||||
```
|
||||
|
||||
2. Run exactly one synthetic-image inference with `device=NPU`.
|
||||
3. Read `after=$(cat "$BUSY")`.
|
||||
4. Require `after - before > 0` and a response-level `npu_busy_delta_us > 0` if the server reports it.
|
||||
5. Repeat with a second synthetic image to avoid counting unrelated startup activity only.
|
||||
6. If HTTP returns 200 but the sysfs delta is zero, document as `NPU not verified` and do not call it an NPU service.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
Defer VLM NPU work if any apply:
|
||||
|
||||
- Model export/compile to NPU fails or requires unsupported ops/custom patches.
|
||||
- First successful inference needs more than 60 seconds cold or more than 10 seconds warm for a small synthetic image.
|
||||
- NPU busy-time delta is zero or inconsistent.
|
||||
- Memory pressure disrupts Whisper `:18816`, embeddings `:18817`, or RAG `:18810`.
|
||||
- The only useful path requires processing private images/docs before synthetic smoke tests are stable.
|
||||
- Captions are too hallucination-prone for automation decisions without a human-review gate.
|
||||
|
||||
## Lightweight image triage/classification path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Recommended near-term path: keep `openvino-doc-image-triage-npu` as the primary image/document lane and add only a static-shape classifier if rule fallback becomes inadequate.
|
||||
|
||||
Candidate classifier families for a later task:
|
||||
|
||||
- MobileNetV3/EfficientNet-Lite/ResNet-18 style image classifier exported to OpenVINO IR.
|
||||
- Use NPU only if the IR compiles with static shapes and produces positive busy-time deltas.
|
||||
- Keep OCR/PDF rendering CPU-local; do not try to force OCR onto NPU in this phase.
|
||||
|
||||
Why:
|
||||
|
||||
- The current triage prototype already has the right privacy contract and reports CPU vs NPU stages.
|
||||
- A small classifier is much lower risk than a VLM and can be used for labels like `screenshot`, `receipt`, `document`, `photo`, `chart`.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
Extend existing CLI shape rather than introduce a new daemon:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python triage.py \
|
||||
--allowed-root "$PWD" \
|
||||
--image-classifier-model /home/will/models/openvino-image-classifier/model.xml \
|
||||
--image-classifier-device NPU \
|
||||
--pretty \
|
||||
samples/synthetic_invoice.png
|
||||
```
|
||||
|
||||
Response addition:
|
||||
|
||||
```json
|
||||
{
|
||||
"classification": {
|
||||
"label": "receipt_or_invoice",
|
||||
"confidence": 0.82,
|
||||
"device": "NPU",
|
||||
"method": "openvino_image_classifier",
|
||||
"npu_busy_delta_us": 12345
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Smoke-test plan
|
||||
|
||||
Reuse `openvino-doc-image-triage-npu/make_samples.py` and `tests/smoke_test.py`; add synthetic image-label assertions only after a classifier model exists. Keep `--no-embeddings` mode available so the smoke suite can separate classifier NPU proof from embeddings `:18817` proof.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- Static-shape classifier cannot compile on NPU.
|
||||
- Labels are not useful enough to drive an assistant workflow.
|
||||
- Classifier output duplicates the existing rule-based fallback.
|
||||
|
||||
## Audio classification path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Defer implementation. If a concrete workflow appears, start with a CLI-only OpenVINO Runtime classifier on CPU/GPU using synthetic/public audio fixtures, not a persistent service.
|
||||
|
||||
Potential model classes:
|
||||
|
||||
- Speech Commands keyword classifier for short command categories.
|
||||
- ESC-50/AudioSet-like environmental sound classifier only if the task requires non-speech detection.
|
||||
- Whisper transcript + lightweight text classifier may be enough for most assistant routing, using existing Whisper NPU `:18816`.
|
||||
|
||||
Why:
|
||||
|
||||
- The system already has local Whisper NPU transcription.
|
||||
- Generic audio classification needs careful threshold tuning and false-positive analysis.
|
||||
- Always-on audio processing has privacy and resource implications; keep it explicit and local.
|
||||
|
||||
### CLI contract
|
||||
|
||||
```bash
|
||||
python audio_classify.py \
|
||||
--input samples/synthetic_chime.wav \
|
||||
--model /home/will/models/openvino-audio-classifier/model.xml \
|
||||
--device CPU \
|
||||
--json
|
||||
```
|
||||
|
||||
Response shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"source_path_basename": "synthetic_chime.wav",
|
||||
"source_sha256": "sha256:...",
|
||||
"sample_rate": 16000,
|
||||
"duration_seconds": 1.2,
|
||||
"labels": [
|
||||
{"label": "chime", "confidence": 0.76}
|
||||
],
|
||||
"device_requested": "CPU",
|
||||
"device_observed": "CPU",
|
||||
"npu_busy_delta_us": null,
|
||||
"privacy": {"external_uploads": false, "raw_audio_logged": false}
|
||||
}
|
||||
```
|
||||
|
||||
Optional HTTP should wait until a workflow exists. If it exists later, bind localhost and avoid overlap with current ports.
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
1. Generate synthetic WAV files in repo-local `samples/`: sine tone, silence, white noise, simple chime, and a short synthetic spoken phrase if a local TTS fixture is available.
|
||||
2. Run CLI on each file with `--allowed-root "$PWD/samples"`.
|
||||
3. Assert JSON parses, durations are bounded, and confidence values are numeric.
|
||||
4. Do not stream microphone input or scan private audio directories in smoke tests.
|
||||
5. If NPU mode is attempted, wrap each inference in sysfs busy-time reads.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- No concrete downstream automation consumes the labels.
|
||||
- False positives cannot be characterized on synthetic/public fixtures.
|
||||
- It competes with Whisper NPU or requires a persistent microphone daemon without explicit approval.
|
||||
|
||||
## Wake-word path
|
||||
|
||||
### Recommended model/runtime
|
||||
|
||||
Recommended first runtime: CPU-only `openWakeWord` CLI/foreground process with ONNX Runtime or TFLite backend.
|
||||
|
||||
NPU recommendation: defer. Try NPU/OpenVINO conversion only after CPU openWakeWord passes false-positive and latency checks.
|
||||
|
||||
Why:
|
||||
|
||||
- Wake-word detection is always-on and latency-sensitive; reliability matters more than accelerator novelty.
|
||||
- The model is small enough that CPU is likely acceptable and simpler.
|
||||
- Keeping wake-word off NPU reduces contention with Whisper NPU and embeddings.
|
||||
- openWakeWord has pre-trained models, optional VAD, and straightforward 16 kHz PCM frame APIs.
|
||||
|
||||
### Endpoint/CLI contract
|
||||
|
||||
CLI smoke contract:
|
||||
|
||||
```bash
|
||||
python wake_word_smoke.py \
|
||||
--model hey_jarvis \
|
||||
--positive samples/synthetic_wake_positive.wav \
|
||||
--negative samples/synthetic_noise.wav \
|
||||
--threshold 0.5 \
|
||||
--json
|
||||
```
|
||||
|
||||
Foreground local stream contract, only for manual experiments:
|
||||
|
||||
```bash
|
||||
python wake_word_listen.py \
|
||||
--model hey_jarvis \
|
||||
--threshold 0.5 \
|
||||
--vad-threshold 0.3 \
|
||||
--oneshot \
|
||||
--json
|
||||
```
|
||||
|
||||
Response/event shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"model": "hey_jarvis",
|
||||
"runtime": "openwakeword-onnxruntime-or-tflite",
|
||||
"device": "CPU",
|
||||
"threshold": 0.5,
|
||||
"events": [
|
||||
{"offset_ms": 1280, "score": 0.83, "detected": true}
|
||||
],
|
||||
"false_positive_count": 0,
|
||||
"npu_busy_delta_us": null,
|
||||
"privacy": {"external_uploads": false, "raw_audio_logged": false}
|
||||
}
|
||||
```
|
||||
|
||||
If a localhost HTTP endpoint is ever needed, do not expose raw microphone streaming by default. Prefer events only:
|
||||
|
||||
- `GET /healthz`
|
||||
- `POST /v1/wakeword/evaluate-file` for explicit files under allowed roots
|
||||
- `GET /v1/wakeword/events` for a manually started foreground listener
|
||||
|
||||
### Smoke-test plan using non-private data
|
||||
|
||||
1. Install in a disposable or dedicated venv, not the existing NPU venv unless explicitly approved:
|
||||
|
||||
```bash
|
||||
python -m venv /tmp/openwakeword-smoke-venv
|
||||
/tmp/openwakeword-smoke-venv/bin/python -m pip install openwakeword
|
||||
```
|
||||
|
||||
2. Use public/generated WAVs only:
|
||||
- Negative: silence, white noise, generic non-wake speech/TTS if locally generated.
|
||||
- Positive: only if a public/pretrained wake phrase fixture is available or generated explicitly for the selected model. If no positive fixture exists, run negative-only false-positive smoke and mark recall untested.
|
||||
3. Assert no false positives over a bounded negative fixture set.
|
||||
4. Measure per-frame CPU latency and max RSS.
|
||||
5. Do not start a persistent microphone listener; manual foreground `--oneshot` only if explicitly approved.
|
||||
|
||||
### NPU busy-time verification plan
|
||||
|
||||
Wake-word should not claim NPU in the initial path. If a later task converts a model to OpenVINO IR and targets NPU:
|
||||
|
||||
1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before a bounded file evaluation.
|
||||
2. Run NPU inference on a fixed set of WAV frames.
|
||||
3. Read the counter after inference.
|
||||
4. Require positive delta and stable predictions matching CPU baseline.
|
||||
5. Also verify that keeping the wake-word loop active does not starve Whisper `:18816` or embeddings `:18817`.
|
||||
|
||||
### No-go / defer criteria
|
||||
|
||||
- CPU openWakeWord has unacceptable false positives on local negative fixtures.
|
||||
- A usable positive fixture cannot be created without recording private audio.
|
||||
- Always-on microphone capture is required before explicit approval.
|
||||
- NPU conversion changes scores materially from CPU baseline.
|
||||
- NPU loop increases contention with Whisper/embedding services.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
If these lanes advance beyond feasibility:
|
||||
|
||||
1. Update `docs/swarm-infrastructure.md` and `docs/swarm-infrastructure.html` to keep live vs prototype labels clear.
|
||||
2. Update the OpenVINO NPU runbook with smoke commands and the sysfs busy-time proof steps.
|
||||
3. Update the Service Catalog only after a service is actually approved/live; until then list as `prototype/not live` or omit.
|
||||
4. Architecture diagrams may show:
|
||||
- live: RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`;
|
||||
- prototypes: reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, doc/image triage optional `:18829`;
|
||||
- VLM/audio/wake-word as `CLI feasibility / not live` unless a later implementation task creates a service.
|
||||
5. Do not imply Atlas/Hermes routing integration for any of these lanes without explicit approval.
|
||||
|
||||
## Overall go/no-go decision
|
||||
|
||||
- Go later: wake-word CPU-only CLI smoke, because it is useful and low risk if kept foreground/local.
|
||||
- Maybe later: lightweight image classifier inside existing doc/image triage, if rule fallback is not enough.
|
||||
- Defer: NPU-first VLM captioning until OpenVINO VLM-on-NPU compatibility is proven by a minimal synthetic-image smoke.
|
||||
- Defer: generic audio classification until there is a concrete assistant workflow that consumes the output.
|
||||
@@ -27,7 +27,7 @@
|
||||
<div class="wrap">
|
||||
<div class="header"><div class="dot"></div><div><h1>Will's Swarm Infrastructure</h1><div class="sub">Atlas/Hermes gateway + n8n automation + agentmon monitoring + local AI/search/voice services</div></div></div>
|
||||
<div class="card">
|
||||
<svg viewBox="0 0 1280 900" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Swarm infrastructure architecture diagram">
|
||||
<svg viewBox="0 0 1280 980" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Swarm infrastructure architecture diagram">
|
||||
<defs>
|
||||
<pattern id="grid" width="40" height="40" patternUnits="userSpaceOnUse"><path d="M 40 0 L 0 0 0 40" fill="none" stroke="#1e293b" stroke-width="0.5"/></pattern>
|
||||
<marker id="arrow" markerWidth="10" markerHeight="10" refX="8" refY="3" orient="auto" markerUnits="strokeWidth"><path d="M0,0 L0,6 L9,3 z" fill="#38bdf8" /></marker>
|
||||
@@ -40,7 +40,7 @@
|
||||
.edge{fill:none; stroke:#38bdf8; stroke-width:1.8; marker-end:url(#arrow); opacity:.8}.edgeG{fill:none; stroke:#34d399; stroke-width:1.8; marker-end:url(#arrowGreen); opacity:.85}.edgeO{fill:none; stroke:#fb923c; stroke-width:1.8; marker-end:url(#arrowOrange); opacity:.85}.edgeR{fill:none; stroke:#fb7185; stroke-width:1.8; stroke-dasharray:5,4; marker-end:url(#arrowRose); opacity:.85}
|
||||
</style>
|
||||
</defs>
|
||||
<rect width="1280" height="900" fill="#020617"/><rect width="1280" height="900" fill="url(#grid)" opacity="0.7"/>
|
||||
<rect width="1280" height="980" fill="#020617"/><rect width="1280" height="980" fill="url(#grid)" opacity="0.7"/>
|
||||
|
||||
<!-- arrows behind nodes -->
|
||||
<path class="edge" d="M140 120 C210 120 210 205 280 205"/>
|
||||
@@ -58,13 +58,14 @@
|
||||
<path class="edge" d="M815 695 C900 695 900 735 965 735"/>
|
||||
<path class="edgeG" d="M625 635 C555 635 555 720 470 720"/>
|
||||
<path class="edge" d="M470 720 C545 720 545 565 620 565"/>
|
||||
<path class="edgeR" d="M490 735 C620 735 790 880 965 880"/>
|
||||
|
||||
<!-- boundaries -->
|
||||
<rect x="250" y="80" width="250" height="260" rx="14" fill="none" stroke="#fbbf24" stroke-width="1.4" stroke-dasharray="8,5" opacity=".75"/>
|
||||
<text x="265" y="103" class="tiny" fill="#fbbf24">Hermes gateway layer</text>
|
||||
<rect x="590" y="105" width="260" height="655" rx="14" fill="none" stroke="#fbbf24" stroke-width="1.4" stroke-dasharray="8,5" opacity=".75"/>
|
||||
<text x="605" y="128" class="tiny" fill="#fbbf24">n8n + agentmon observability</text>
|
||||
<rect x="935" y="95" width="280" height="760" rx="14" fill="none" stroke="#fbbf24" stroke-width="1.4" stroke-dasharray="8,5" opacity=".75"/>
|
||||
<rect x="935" y="95" width="280" height="850" rx="14" fill="none" stroke="#fbbf24" stroke-width="1.4" stroke-dasharray="8,5" opacity=".75"/>
|
||||
<text x="950" y="118" class="tiny" fill="#fbbf24">local swarm services</text>
|
||||
|
||||
<!-- external channels -->
|
||||
@@ -86,28 +87,29 @@
|
||||
<g><rect x="965" y="385" width="210" height="80" rx="9" fill="#0f172a"/><rect x="965" y="385" width="210" height="80" rx="9" fill="rgba(8,51,68,.4)" stroke="#22d3ee" stroke-width="1.6"/><text x="1070" y="415" text-anchor="middle" class="title">Voice</text><text x="1070" y="436" text-anchor="middle" class="tiny">Kokoro + Whisper</text><text x="1070" y="454" text-anchor="middle" class="port">:18805 / :18816</text></g>
|
||||
<g><rect x="965" y="555" width="210" height="80" rx="9" fill="#0f172a"/><rect x="965" y="555" width="210" height="80" rx="9" fill="rgba(76,29,149,.4)" stroke="#a78bfa" stroke-width="1.6"/><text x="1070" y="585" text-anchor="middle" class="title">Docker services</text><text x="1070" y="606" text-anchor="middle" class="tiny">agentmon.monitor=true</text><text x="1070" y="624" text-anchor="middle" class="port">swarm/service snapshots</text></g>
|
||||
<g><rect x="965" y="665" width="210" height="80" rx="9" fill="#0f172a"/><rect x="965" y="665" width="210" height="80" rx="9" fill="rgba(120,53,15,.3)" stroke="#fbbf24" stroke-width="1.6"/><text x="1070" y="695" text-anchor="middle" class="title">OpenClaw VMs</text><text x="1070" y="716" text-anchor="middle" class="tiny">currently dormant</text><text x="1070" y="734" text-anchor="middle" class="port">openclaw.snapshot</text></g>
|
||||
<g><rect x="965" y="775" width="210" height="60" rx="9" fill="#0f172a"/><rect x="965" y="775" width="210" height="60" rx="9" fill="rgba(76,29,149,.4)" stroke="#a78bfa" stroke-width="1.6"/><text x="1070" y="802" text-anchor="middle" class="title">Obsidian / RAG</text><text x="1070" y="822" text-anchor="middle" class="port">:27123/:27124 + ChromaDB</text></g>
|
||||
<g><rect x="965" y="775" width="210" height="75" rx="9" fill="#0f172a"/><rect x="965" y="775" width="210" height="75" rx="9" fill="rgba(76,29,149,.4)" stroke="#a78bfa" stroke-width="1.6"/><text x="1070" y="802" text-anchor="middle" class="title">Obsidian / RAG</text><text x="1070" y="821" text-anchor="middle" class="tiny">RAG endpoint :18810</text><text x="1070" y="840" text-anchor="middle" class="port">Chroma obsidian_bge_npu</text></g>
|
||||
<g><rect x="965" y="870" width="210" height="80" rx="9" fill="#0f172a"/><rect x="965" y="870" width="210" height="80" rx="9" fill="rgba(244,63,94,.16)" stroke="#fb7185" stroke-width="1.6" stroke-dasharray="6,4"/><text x="1070" y="896" text-anchor="middle" class="title">NPU sidecars</text><text x="1070" y="917" text-anchor="middle" class="tiny">approved prototypes; not live</text><text x="1070" y="936" text-anchor="middle" class="port">:18818/:18819/:18820/:18829</text></g>
|
||||
|
||||
<!-- host local ai box -->
|
||||
<g><rect x="280" y="675" width="210" height="120" rx="10" fill="#0f172a"/><rect x="280" y="675" width="210" height="120" rx="10" fill="rgba(76,29,149,.4)" stroke="#a78bfa" stroke-width="1.8"/><text x="385" y="706" text-anchor="middle" class="title">host local AI</text><text x="385" y="730" text-anchor="middle" class="tiny">llama.cpp :18806</text><text x="385" y="752" text-anchor="middle" class="tiny">Ollama fallback :18807</text><text x="385" y="774" text-anchor="middle" class="tiny">OpenVINO NPU embed :18817</text></g>
|
||||
<g><rect x="280" y="675" width="210" height="145" rx="10" fill="#0f172a"/><rect x="280" y="675" width="210" height="145" rx="10" fill="rgba(76,29,149,.4)" stroke="#a78bfa" stroke-width="1.8"/><text x="385" y="706" text-anchor="middle" class="title">host local AI</text><text x="385" y="730" text-anchor="middle" class="tiny">llama.cpp :18806</text><text x="385" y="752" text-anchor="middle" class="tiny">Ollama fallback :18807</text><text x="385" y="774" text-anchor="middle" class="tiny">OpenVINO embed :18817 live</text><text x="385" y="797" text-anchor="middle" class="tiny">Whisper NPU :18816 live</text></g>
|
||||
|
||||
<!-- legend -->
|
||||
<g transform="translate(40,820)">
|
||||
<g transform="translate(40,910)">
|
||||
<text class="tiny" fill="#94a3b8">Legend</text>
|
||||
<rect x="0" y="16" width="14" height="10" fill="rgba(8,51,68,.4)" stroke="#22d3ee"/><text x="22" y="25" class="tiny">Gateway/Search/Voice</text>
|
||||
<rect x="180" y="16" width="14" height="10" fill="rgba(6,78,59,.4)" stroke="#34d399"/><text x="202" y="25" class="tiny">Automation/API</text>
|
||||
<rect x="320" y="16" width="14" height="10" fill="rgba(76,29,149,.4)" stroke="#a78bfa"/><text x="342" y="25" class="tiny">Data/AI stores</text>
|
||||
<rect x="475" y="16" width="14" height="10" fill="rgba(251,146,60,.14)" stroke="#fb923c"/><text x="497" y="25" class="tiny">Event bus/pipeline</text>
|
||||
<line x1="650" y1="22" x2="700" y2="22" class="edgeR"/><text x="710" y="25" class="tiny">Monitoring flows</text>
|
||||
<line x1="650" y1="22" x2="700" y2="22" class="edgeR"/><text x="710" y="25" class="tiny">Monitoring / not-live prototype flows</text>
|
||||
</g>
|
||||
</svg>
|
||||
</div>
|
||||
<div class="cards">
|
||||
<div class="info"><h3>Monitoring model</h3><ul><li>• n8n direct probes critical ports</li><li>• agentmon aggregates Docker/OpenClaw snapshots</li><li>• n8n polls agentmon for stale/degraded state</li></ul></div>
|
||||
<div class="info"><h3>Operational endpoints</h3><ul><li>• n8n: 127.0.0.1:18808</li><li>• agentmon query/UI: 8081 / 8082</li><li>• local LLM/embed: 18806 / 18817</li><li>• Ollama fallback: 18807</li></ul></div>
|
||||
<div class="info"><h3>Operational endpoints</h3><ul><li>• n8n: 127.0.0.1:18808</li><li>• agentmon query/UI: 8081 / 8082</li><li>• live NPU: RAG 18810, Whisper 18816, embeddings 18817</li><li>• prototypes not live-routed: 18818/18819/18820/18829</li></ul></div>
|
||||
<div class="info"><h3>Source paths</h3><ul><li>• Swarm repo: ~/lab/swarm</li><li>• Agentmon repo: ~/lab/agentmon</li><li>• Workflows: swarm-common/n8n-workflows</li></ul></div>
|
||||
</div>
|
||||
<div class="footer">Generated as repo documentation. Open locally in a browser; no JavaScript, all SVG inline.</div>
|
||||
<div class="footer">Generated as repo documentation. Open locally in a browser; no JavaScript, all SVG inline. Dashed red OpenVINO NPU sidecars are approved prototypes only and do not imply live Atlas/Hermes/RAG routing.</div>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
@@ -36,6 +36,7 @@ local AI/search/voice services
|
||||
+--> OpenVINO NPU embeddings :18817
|
||||
+--> Kokoro TTS :18805
|
||||
+--> Whisper NPU :18816
|
||||
+--> approved/not-live NPU sidecars: reranker :18818, router/classifier :18819, GenAI worker :18820, doc/image triage optional :18829
|
||||
```
|
||||
|
||||
See also:
|
||||
@@ -130,6 +131,17 @@ Host/user services:
|
||||
- `voice-memo-processor.service` — `:18813`, voice memo processing
|
||||
- `rag-embedding-health.service` — `:18814`, RAG/embedding health wrapper
|
||||
|
||||
Approved but not live-routed OpenVINO NPU sidecars:
|
||||
|
||||
| Port | Component | State | Safety boundary |
|
||||
| ---: | --- | --- | --- |
|
||||
| `18818` | reranker | approved prototype; optional foreground/user-systemd only | request-time only; no Chroma/vector mutation; no live RAG integration unless Will approves |
|
||||
| `18819` | router/classifier | approved prototype; dry-run only | no Hermes/Atlas routing, memory writes, service restarts, or outbound messages |
|
||||
| `18820` | bounded GenAI worker | approved prototype | background jobs only; not primary Atlas/Hermes model routing |
|
||||
| `18829` | document/image triage | CLI-first; optional localhost server | synthetic/non-private smoke data only; no private directory processing; NPU stage is embeddings via `:18817` |
|
||||
|
||||
These sidecars must bind to `127.0.0.1` by default, must not be enabled persistently or wired into live Atlas/Hermes/RAG paths without explicit Will approval, and any NPU claim requires a positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta before/after inference. HTTP 200 alone is not proof.
|
||||
|
||||
### 5. Obsidian and RAG
|
||||
|
||||
Vault:
|
||||
@@ -201,6 +213,7 @@ From the host:
|
||||
cd /home/will/lab/swarm
|
||||
make status
|
||||
make local-ai-health
|
||||
./scripts/npu-service-health.sh # read-only; includes sysfs busy-time proof for :18817
|
||||
curl -fsS http://127.0.0.1:18808/healthz
|
||||
curl -fsS http://127.0.0.1:8081/healthz
|
||||
curl -fsS 'http://127.0.0.1:8081/v1/events?event_type=swarm.snapshot&limit=1' | jq .
|
||||
@@ -234,3 +247,4 @@ jq '.[0] | {id,name,active,nodes:(.nodes|length)}' /tmp/agentmon-export.json
|
||||
- From `n8n-agent`, use `127.0.0.1:5678` for n8n itself and `172.19.0.1:<host-port>` for host-published swarm services.
|
||||
- Agentmon `/healthz` only proves the web/API process is alive; pair it with snapshot freshness to prove the monitoring pipeline is flowing.
|
||||
- OpenClaw is intentionally dormant unless explicitly re-enabled; do not alert on VMs being shut off by default.
|
||||
- OpenVINO NPU sidecars on `:18818`, `:18819`, `:18820`, and optional `:18829` are prototypes/not-live unless a later approved change installs and routes them. Do not draw live Atlas/Hermes/RAG arrows to them in diagrams until that approval and implementation actually exist.
|
||||
|
||||
@@ -0,0 +1,339 @@
|
||||
# OpenVINO NPU classifier/router dry-run contract
|
||||
|
||||
Status: specification for dry-run prototype refresh
|
||||
Target port: `127.0.0.1:18819`
|
||||
Owner context: Atlas/Hermes local assistant sidecar evaluation
|
||||
|
||||
This service is an advisory classifier for Atlas/Hermes automation hints. It may suggest labels such as tool-needed, memory-candidate type, urgency, workflow category, and safety-confirmation-required, but it must not make or enforce live routing, memory, tool, or safety decisions without a separate explicit approval from Will.
|
||||
|
||||
## Recommended model and runtime
|
||||
|
||||
Recommended v1 runtime: small local Python HTTP/CLI service backed by the existing OpenVINO NPU embeddings service on `127.0.0.1:18817`.
|
||||
|
||||
Recommended v1 model shape:
|
||||
|
||||
- Primary signal: `bge-base-en-v1.5-int8-ov` embeddings from the live embeddings service.
|
||||
- Classifier layer: inspectable deterministic rules plus cosine similarity against curated synthetic/prototype utterances.
|
||||
- Model label: `bge-base-en-v1.5-int8-ov/prototype-router-v0`.
|
||||
- Device proof: request-level `npu_busy_delta_us` from `:18817` plus direct sysfs before/after reads from `/sys/class/accel/accel0/device/npu_busy_time_us`.
|
||||
|
||||
Why this is preferred for the dry run:
|
||||
|
||||
1. It reuses the already-live NPU embeddings path rather than adding a second model conversion/runtime dependency before contract validation.
|
||||
2. Rules and prototypes are transparent enough for safety-sensitive routing hints; a reviewer can inspect why a message was labeled.
|
||||
3. It avoids fine-tuning or training on private Atlas/Hermes transcripts.
|
||||
4. It keeps the service small, localhost-only, and easy to start/stop during smoke tests.
|
||||
5. It produces NPU activity through the embeddings path while making clear that final decision logic remains advisory.
|
||||
|
||||
Defer a dedicated NPU sequence-classification model such as TinyBERT/MiniLM until the dry-run labels and thresholds have been evaluated against synthetic fixtures and explicitly-approved non-private examples. If pursued later, use OpenVINO Runtime/Optimum export with fixed input shapes suitable for NPU, and keep the rule layer for safety gates.
|
||||
|
||||
## Non-goals and safety invariants
|
||||
|
||||
The service must not:
|
||||
|
||||
- Change Hermes/Atlas model routing, gateway routing, memory writes, tool-use permissions, or safety-confirmation behavior.
|
||||
- Restart, stop, enable, or persist any live Atlas/Hermes/gateway/RAG service.
|
||||
- Bind to anything broader than `127.0.0.1` by default.
|
||||
- Mutate Chroma/vector collections, trigger reindexing, or write to RAG state.
|
||||
- Process private document/image directories or private transcript dumps for smoke testing.
|
||||
- Log raw prompts by default beyond normal foreground stderr during local review.
|
||||
- Claim NPU success from HTTP 200 alone.
|
||||
|
||||
## Endpoint contract
|
||||
|
||||
All HTTP endpoints are local-only by default.
|
||||
|
||||
Base URL:
|
||||
|
||||
```text
|
||||
http://127.0.0.1:18819
|
||||
```
|
||||
|
||||
### GET `/healthz`, `/health`, `/readyz`, `/`
|
||||
|
||||
Purpose: liveness/readiness metadata.
|
||||
|
||||
Response fields:
|
||||
|
||||
- `status`: `starting | ok`
|
||||
- `service`: `atlas-router-classifier`
|
||||
- `version`: service version string
|
||||
- `mode`: always `dry_run`
|
||||
- `model`: model/runtime label
|
||||
- `embed_url`: upstream embeddings URL
|
||||
- `device`: expected to say `NPU-via-embedding-service` or equivalent
|
||||
- `labels`: supported label names
|
||||
- `embedding_dim`: embedding dimension after warmup
|
||||
- `prototype_count`: number of synthetic prototype examples loaded
|
||||
- `prototype_npu_busy_delta_us`: warmup delta reported by upstream embeddings, if available
|
||||
- `npu_busy_time_us`: current sysfs counter value, if readable
|
||||
- `warnings`: list of non-fatal warnings
|
||||
|
||||
A healthy service is not enough to prove NPU execution. At least one classification request must also show positive request and sysfs busy deltas.
|
||||
|
||||
### GET `/v1/labels`
|
||||
|
||||
Purpose: publish schema information without dumping private examples.
|
||||
|
||||
Response fields:
|
||||
|
||||
- `model`
|
||||
- `thresholds`
|
||||
- `tool_needed`: recommended threshold `0.72`
|
||||
- `memory_candidate`: recommended threshold `0.78`
|
||||
- `safety_confirmation_required`: recommended threshold `0.80`
|
||||
- `workflow_category`: recommended threshold `0.52`
|
||||
- `enums`
|
||||
- `memory_candidate`: `none`, `user_preference`, `durable_user_fact`, `environment_fact`, `workflow_convention`, `skill_candidate`
|
||||
- `urgency`: `low`, `normal`, `high`, `critical`
|
||||
- `workflow_category`: `chat`, `research`, `coding`, `debugging`, `devops`, `smart_home`, `media`, `note_taking`, `productivity`, `kanban`, `unknown`
|
||||
- `prototype_ids`: names of curated synthetic prototype buckets
|
||||
|
||||
### POST `/v1/classify`
|
||||
|
||||
Purpose: classify one user/task message for advisory dry-run hints.
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "optional-trace-id",
|
||||
"text": "Urgent: check whether port 18817 is listening and inspect systemd logs.",
|
||||
"context": {
|
||||
"platform": "cli",
|
||||
"source": "user"
|
||||
},
|
||||
"options": {
|
||||
"include_evidence": true,
|
||||
"include_embedding_debug": false,
|
||||
"dry_run": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Required behavior:
|
||||
|
||||
- Reject empty text with HTTP 400.
|
||||
- Default `dry_run` to true.
|
||||
- Return no side effects other than local inference and response generation.
|
||||
- Include evidence by default unless `include_evidence=false`.
|
||||
- Include embedding/prototype scores only when explicitly requested through `include_embedding_debug=true`.
|
||||
|
||||
Response:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "optional-trace-id",
|
||||
"model": "bge-base-en-v1.5-int8-ov/prototype-router-v0",
|
||||
"created": 1780590000,
|
||||
"duration_ms": 12.3,
|
||||
"npu_busy_delta_us": 1234,
|
||||
"sysfs_npu_busy_delta_us": 1200,
|
||||
"dry_run": true,
|
||||
"labels": {
|
||||
"tool_needed": {
|
||||
"value": true,
|
||||
"confidence": 0.84,
|
||||
"threshold": 0.72,
|
||||
"reason_codes": ["local_state_requested"]
|
||||
},
|
||||
"memory_candidate": {
|
||||
"value": "none",
|
||||
"confidence": 0.31,
|
||||
"threshold": 0.78,
|
||||
"reason_codes": []
|
||||
},
|
||||
"urgency": {
|
||||
"value": "high",
|
||||
"confidence": 0.84,
|
||||
"scores": {"low": 0.0, "normal": 0.2, "high": 0.84, "critical": 0.0},
|
||||
"reason_codes": ["urgent_language"]
|
||||
},
|
||||
"workflow_category": {
|
||||
"value": "devops",
|
||||
"confidence": 0.86,
|
||||
"scores": {"devops": 0.86, "unknown": 0.14}
|
||||
},
|
||||
"safety_confirmation_required": {
|
||||
"value": false,
|
||||
"confidence": 0.0,
|
||||
"threshold": 0.8,
|
||||
"reason_codes": []
|
||||
}
|
||||
},
|
||||
"warnings": [],
|
||||
"evidence": []
|
||||
}
|
||||
```
|
||||
|
||||
### POST `/v1/batch_classify`
|
||||
|
||||
Purpose: classify a bounded batch of non-private synthetic or explicitly-approved messages.
|
||||
|
||||
Request:
|
||||
|
||||
```json
|
||||
{
|
||||
"items": [
|
||||
{"id": "m1", "text": "What time is it in Seattle right now?"},
|
||||
{"id": "m2", "text": "Restart the live Atlas gateway and switch primary routing."}
|
||||
],
|
||||
"options": {"include_evidence": false, "dry_run": true}
|
||||
}
|
||||
```
|
||||
|
||||
Response:
|
||||
|
||||
- `model`
|
||||
- `duration_ms`
|
||||
- aggregate `npu_busy_delta_us`
|
||||
- `results`: array of `/v1/classify` responses
|
||||
|
||||
Batch limits for prototype review:
|
||||
|
||||
- Keep batches small; the prototype rejects empty batches and batches larger than `OPENVINO_CLASSIFIER_MAX_BATCH_SIZE` (default `32`).
|
||||
- Use only synthetic fixtures unless Will explicitly approves a real non-private sample set.
|
||||
- Do not retain request bodies to disk.
|
||||
|
||||
## CLI contract
|
||||
|
||||
The same implementation should support foreground review from the service directory:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-classifier-npu
|
||||
/home/will/.venvs/npu/bin/python router_classifier.py \
|
||||
--host 127.0.0.1 \
|
||||
--port 18819 \
|
||||
--embed-url http://127.0.0.1:18817/v1/embeddings
|
||||
```
|
||||
|
||||
Required flags/env:
|
||||
|
||||
- `--host` / `OPENVINO_CLASSIFIER_HOST`; default `127.0.0.1`.
|
||||
- `--port` / `OPENVINO_CLASSIFIER_PORT`; default `18819`.
|
||||
- `--embed-url` / `OPENVINO_CLASSIFIER_EMBED_URL`; default `http://127.0.0.1:18817/v1/embeddings`.
|
||||
- `--timeout-s` / `OPENVINO_CLASSIFIER_TIMEOUT_S`; default `30`.
|
||||
- `--max-batch-size` / `OPENVINO_CLASSIFIER_MAX_BATCH_SIZE`; default `32`.
|
||||
- `--no-warmup` to defer prototype embedding until first request.
|
||||
|
||||
A future dedicated CLI mode may be added for one-shot JSONL classification, but foreground HTTP review is sufficient for the dry-run contract.
|
||||
|
||||
## Synthetic smoke-test plan
|
||||
|
||||
Preconditions:
|
||||
|
||||
1. Confirm `:18817` embeddings service is healthy.
|
||||
2. Confirm `:18819` is not already listening.
|
||||
3. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before starting the request smoke.
|
||||
4. Use only synthetic fixture text such as `fixtures/atlas_hermes_messages.jsonl`.
|
||||
|
||||
Unit/schema smoke, no NPU dependency:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm
|
||||
/home/will/.venvs/npu/bin/python -m unittest discover -s openvino-classifier-npu/tests -v
|
||||
```
|
||||
|
||||
Foreground service smoke:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18819\b' || true
|
||||
cd /home/will/lab/swarm/openvino-classifier-npu
|
||||
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819
|
||||
```
|
||||
|
||||
From another shell:
|
||||
|
||||
```bash
|
||||
curl -fsS http://127.0.0.1:18819/healthz | jq .
|
||||
curl -fsS http://127.0.0.1:18819/v1/labels | jq .
|
||||
curl -fsS http://127.0.0.1:18819/v1/classify \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"id":"smoke-devops","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":true,"dry_run":true}}' | jq .
|
||||
curl -fsS http://127.0.0.1:18819/v1/classify \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"id":"smoke-safety","text":"Restart the live Atlas gateway and switch primary routing to the new classifier.","options":{"include_evidence":true,"dry_run":true}}' | jq .
|
||||
```
|
||||
|
||||
Expected label checks:
|
||||
|
||||
- `smoke-devops`: `tool_needed.value=true`, `urgency.value=high`, `workflow_category.value=devops`.
|
||||
- `smoke-safety`: `safety_confirmation_required.value=true`, no actual restart or routing change.
|
||||
- Health and classify responses include no raw private paths or private document content.
|
||||
|
||||
Shutdown:
|
||||
|
||||
- Stop the foreground server with Ctrl-C.
|
||||
- Re-run `ss -ltnp | grep ':18819\b' || true` and confirm no listener remains.
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
Use sysfs plus service response fields; do not accept HTTP 200 alone.
|
||||
|
||||
```bash
|
||||
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
||||
before=$(cat "$BUSY")
|
||||
response=$(curl -fsS http://127.0.0.1:18819/v1/classify \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"id":"npu-proof","text":"Check current systemd service status for the embeddings service.","options":{"include_evidence":false,"dry_run":true}}')
|
||||
after=$(cat "$BUSY")
|
||||
echo "$response" | jq '{npu_busy_delta_us, sysfs_npu_busy_delta_us, warnings}'
|
||||
echo "outer_sysfs_npu_busy_delta_us=$((after-before))"
|
||||
```
|
||||
|
||||
Optional localhost smoke helper, after starting the foreground service:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python openvino-classifier-npu/smoke_classifier.py \
|
||||
--base-url http://127.0.0.1:18819
|
||||
```
|
||||
|
||||
Acceptance for an NPU-backed classification request:
|
||||
|
||||
- HTTP request succeeds.
|
||||
- Response `npu_busy_delta_us > 0` from upstream embeddings.
|
||||
- Response `sysfs_npu_busy_delta_us > 0` when sysfs is readable.
|
||||
- Outer shell `after-before > 0`.
|
||||
- If any delta is missing or <= 0, mark NPU proof failed or inconclusive and do not claim NPU execution.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
If this prototype is refreshed or reviewed, update documentation to show:
|
||||
|
||||
- Live baseline remains RAG `:18810`, RAG health `:18814`, Whisper NPU `:18816`, and embeddings `:18817`.
|
||||
- Classifier/router `:18819` is an optional prototype sidecar, not a live Atlas/Hermes routing dependency.
|
||||
- Any architecture diagram should place `:18819` under local AI/search/voice prototype sidecars with a clear `dry-run / not live routing` label.
|
||||
- Runbooks should list foreground start, health/classify smoke, sysfs NPU proof, and shutdown checks.
|
||||
- Service catalog entries should state `not installed/enabled` until Will approves persistent service enablement.
|
||||
- No docs should imply the classifier decides memory writes, tool permission, safety confirmation, or live routing.
|
||||
|
||||
Relevant docs inventory:
|
||||
|
||||
- `docs/swarm-infrastructure.md`
|
||||
- `docs/swarm-infrastructure.html`
|
||||
- `docs/diagram-maintenance.md`
|
||||
- `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md`
|
||||
- `swarm-common/obsidian-vault/will/will-shared-zap/Resources/Service Catalog.md`
|
||||
|
||||
## No-go / defer criteria
|
||||
|
||||
Do not proceed to implementation refresh, persistent service enablement, or live integration if any of the following hold:
|
||||
|
||||
- `:18817` embeddings is unavailable and no approved NPU embedding fallback exists.
|
||||
- `/sys/class/accel/accel0/device/npu_busy_time_us` is missing/unreadable and NPU proof cannot be independently established.
|
||||
- Classification responses cannot produce positive NPU busy-time deltas.
|
||||
- `:18819` is already occupied by an unknown or live service.
|
||||
- Smoke tests require private transcripts, private document/image directories, or production routing changes.
|
||||
- Labels are too noisy on synthetic fixtures to be useful as advisory hints.
|
||||
- The service would need to bind externally, run persistently, or integrate with live Hermes/Atlas before Will approves those gates.
|
||||
- Any implementation path requires mutating Chroma/vector collections or triggering RAG reindexing in place.
|
||||
|
||||
## Implementation handoff notes
|
||||
|
||||
Recommended next engineer actions:
|
||||
|
||||
1. Verify or refresh `openvino-classifier-npu/router_classifier.py` to match this contract.
|
||||
2. Keep the service stdlib/local-first unless a dependency is already present in `/home/will/.venvs/npu`.
|
||||
3. Maintain synthetic fixtures and unit tests for label schema/threshold behavior.
|
||||
4. Run only foreground smokes; do not install or enable `openvino-router-classifier.service`.
|
||||
5. Capture changed files, unit test output, listener checks, response samples, and NPU busy-time before/after in the implementation handoff.
|
||||
@@ -2,6 +2,10 @@
|
||||
|
||||
Dry-run Atlas/Hermes message classifier/router prototype.
|
||||
|
||||
The detailed dry-run contract is in [`CONTRACT.md`](./CONTRACT.md), including the
|
||||
recommended model/runtime, HTTP/CLI schema, smoke-test plan, NPU busy-time proof,
|
||||
docs/diagram implications, and no-go/defer criteria.
|
||||
|
||||
It reuses the existing OpenVINO NPU embeddings service on `127.0.0.1:18817` and
|
||||
serves an inspectable stdlib HTTP API on `127.0.0.1:18819`. It does not change
|
||||
live Hermes/Atlas routing, write memory, mutate vector collections, restart
|
||||
@@ -13,6 +17,7 @@ services, or send external messages.
|
||||
- Default port: `18819`
|
||||
- Default bind: `127.0.0.1`
|
||||
- Upstream: `http://127.0.0.1:18817/v1/embeddings`
|
||||
- Batch limit: `OPENVINO_CLASSIFIER_MAX_BATCH_SIZE`, default `32`
|
||||
- Model label: `bge-base-en-v1.5-int8-ov/prototype-router-v0`
|
||||
- NPU proof: `/sys/class/accel/accel0/device/npu_busy_time_us` before/after plus upstream `npu_busy_delta_us`
|
||||
|
||||
@@ -86,6 +91,10 @@ cd /home/will/lab/swarm/openvino-classifier-npu
|
||||
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819
|
||||
```
|
||||
|
||||
Environment variables mirror the flags: `OPENVINO_CLASSIFIER_HOST`,
|
||||
`OPENVINO_CLASSIFIER_PORT`, `OPENVINO_CLASSIFIER_EMBED_URL`,
|
||||
`OPENVINO_CLASSIFIER_TIMEOUT_S`, and `OPENVINO_CLASSIFIER_MAX_BATCH_SIZE`.
|
||||
|
||||
Then from another shell:
|
||||
|
||||
```bash
|
||||
@@ -98,6 +107,15 @@ curl -fsS http://127.0.0.1:18819/v1/classify \
|
||||
A valid NPU-backed response must have positive `npu_busy_delta_us`; HTTP 200 by
|
||||
itself is not considered proof.
|
||||
|
||||
Synthetic fixture smoke helper, after the foreground service is running:
|
||||
|
||||
```bash
|
||||
/home/will/.venvs/npu/bin/python smoke_classifier.py --base-url http://127.0.0.1:18819
|
||||
```
|
||||
|
||||
The helper refuses non-local URLs, checks fixture label expectations, and prints
|
||||
response plus outer sysfs NPU busy deltas.
|
||||
|
||||
## Tests
|
||||
|
||||
Unit tests use a fake embedding client and do not touch the NPU:
|
||||
@@ -116,7 +134,8 @@ after review/approval:
|
||||
```bash
|
||||
cp openvino-router-classifier.service ~/.config/systemd/user/openvino-router-classifier.service
|
||||
systemctl --user daemon-reload
|
||||
systemctl --user enable --now openvino-router-classifier.service
|
||||
systemctl --user start openvino-router-classifier.service
|
||||
systemctl --user status openvino-router-classifier.service --no-pager
|
||||
```
|
||||
|
||||
Do not enable it as part of this prototype task without explicit approval.
|
||||
Do not enable it at boot or connect it to live Atlas/Hermes routing as part of this prototype task without explicit approval. Keep classifier decisions dry-run until a separate approved routing change lands.
|
||||
|
||||
@@ -9,6 +9,7 @@ WorkingDirectory=/home/will/lab/swarm/openvino-classifier-npu
|
||||
Environment=OPENVINO_CLASSIFIER_HOST=127.0.0.1
|
||||
Environment=OPENVINO_CLASSIFIER_PORT=18819
|
||||
Environment=OPENVINO_CLASSIFIER_EMBED_URL=http://127.0.0.1:18817/v1/embeddings
|
||||
Environment=OPENVINO_CLASSIFIER_MAX_BATCH_SIZE=32
|
||||
ExecStart=/home/will/.venvs/npu/bin/python /home/will/lab/swarm/openvino-classifier-npu/router_classifier.py
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
@@ -30,6 +30,7 @@ MODEL = "bge-base-en-v1.5-int8-ov/prototype-router-v0"
|
||||
DEFAULT_HOST = "127.0.0.1"
|
||||
DEFAULT_PORT = 18819
|
||||
DEFAULT_EMBED_URL = "http://127.0.0.1:18817/v1/embeddings"
|
||||
DEFAULT_MAX_BATCH_SIZE = 32
|
||||
NPU_BUSY_FILE = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
|
||||
WORKFLOW_CATEGORIES = [
|
||||
@@ -150,6 +151,26 @@ def npu_busy_time_us() -> int | None:
|
||||
return None
|
||||
|
||||
|
||||
def env_int(name: str, default: int) -> int:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None:
|
||||
return default
|
||||
try:
|
||||
return int(raw)
|
||||
except ValueError as exc:
|
||||
raise SystemExit(f"{name} must be an integer, got {raw!r}") from exc
|
||||
|
||||
|
||||
def env_float(name: str, default: float) -> float:
|
||||
raw = os.environ.get(name)
|
||||
if raw is None:
|
||||
return default
|
||||
try:
|
||||
return float(raw)
|
||||
except ValueError as exc:
|
||||
raise SystemExit(f"{name} must be a number, got {raw!r}") from exc
|
||||
|
||||
|
||||
def clamp01(value: float) -> float:
|
||||
return max(0.0, min(1.0, value))
|
||||
|
||||
@@ -220,9 +241,10 @@ class EmbeddingClient:
|
||||
|
||||
|
||||
class ClassifierService:
|
||||
def __init__(self, embed_url: str, *, timeout_s: float = 30.0) -> None:
|
||||
def __init__(self, embed_url: str, *, timeout_s: float = 30.0, max_batch_size: int = DEFAULT_MAX_BATCH_SIZE) -> None:
|
||||
self.embed_url = embed_url
|
||||
self.client = EmbeddingClient(embed_url, timeout_s=timeout_s)
|
||||
self.max_batch_size = max(1, int(max_batch_size))
|
||||
self.loaded_at = time.time()
|
||||
self.prototype_texts: list[str] = []
|
||||
self.prototype_keys: list[str] = []
|
||||
@@ -255,6 +277,7 @@ class ClassifierService:
|
||||
"labels": ["tool_needed", "memory_candidate", "urgency", "workflow_category", "safety_confirmation_required"],
|
||||
"embedding_dim": self.embedding_dim,
|
||||
"prototype_count": len(self.prototype_texts),
|
||||
"max_batch_size": self.max_batch_size,
|
||||
"prototype_npu_busy_delta_us": self.prototype_npu_busy_delta_us,
|
||||
"npu_busy_time_us": npu_busy_time_us(),
|
||||
"uptime_s": round(time.time() - self.loaded_at, 3),
|
||||
@@ -271,6 +294,7 @@ class ClassifierService:
|
||||
"workflow_category": 0.52,
|
||||
},
|
||||
"enums": {"memory_candidate": MEMORY_VALUES, "urgency": URGENCY_VALUES, "workflow_category": WORKFLOW_CATEGORIES},
|
||||
"limits": {"max_batch_size": self.max_batch_size},
|
||||
"prototype_ids": sorted(PROTOTYPES),
|
||||
}
|
||||
|
||||
@@ -351,6 +375,10 @@ class ClassifierService:
|
||||
return response
|
||||
|
||||
def batch_classify(self, items: list[dict[str, Any]], options: dict[str, Any] | None = None) -> dict[str, Any]:
|
||||
if not items:
|
||||
raise ValueError("items must contain at least one classification request")
|
||||
if len(items) > self.max_batch_size:
|
||||
raise ValueError(f"items exceeds max_batch_size={self.max_batch_size}")
|
||||
started = time.perf_counter()
|
||||
results = [self.classify(item.get("id"), str(item.get("text") or ""), options) for item in items]
|
||||
return {
|
||||
@@ -400,13 +428,15 @@ class ClassifierService:
|
||||
high_rule, high_codes, high_ev = best_rule(text, "urgency_high")
|
||||
critical_rule, critical_codes, critical_ev = best_rule(text, "urgency_critical")
|
||||
low_rule = 0.82 if re.search(r"\b(no rush|whenever convenient|low priority|someday|backlog)\b", text, re.I) else 0.0
|
||||
# Urgency is safety-sensitive for notifications. Prefer explicit rules;
|
||||
# use prototype scores only when they are unusually strong.
|
||||
# Urgency is safety-sensitive for notifications, so require explicit
|
||||
# language instead of relying on broad prototype similarity.
|
||||
score_map = {
|
||||
"low": max(low_rule, scores.get("urgency_low", 0.0) if scores.get("urgency_low", 0.0) >= 0.9 else 0.0),
|
||||
# Urgency should be explicit; broad embedding similarity otherwise
|
||||
# turns neutral requests such as "what time is it" into low/high/critical urgency.
|
||||
"low": low_rule,
|
||||
"normal": 0.68,
|
||||
"high": max(high_rule, scores.get("urgency_high", 0.0) if scores.get("urgency_high", 0.0) >= 0.9 else 0.0),
|
||||
"critical": max(critical_rule, scores.get("urgency_critical", 0.0) if scores.get("urgency_critical", 0.0) >= 0.92 else 0.0),
|
||||
"high": high_rule,
|
||||
"critical": critical_rule,
|
||||
}
|
||||
if score_map["critical"] >= 0.9:
|
||||
score_map["normal"] = 0.05
|
||||
@@ -509,13 +539,14 @@ class Handler(BaseHTTPRequestHandler):
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Dry-run Atlas/Hermes router classifier")
|
||||
parser.add_argument("--host", default=os.environ.get("OPENVINO_CLASSIFIER_HOST", DEFAULT_HOST))
|
||||
parser.add_argument("--port", type=int, default=int(os.environ.get("OPENVINO_CLASSIFIER_PORT", DEFAULT_PORT)))
|
||||
parser.add_argument("--port", type=int, default=env_int("OPENVINO_CLASSIFIER_PORT", DEFAULT_PORT))
|
||||
parser.add_argument("--embed-url", default=os.environ.get("OPENVINO_CLASSIFIER_EMBED_URL", DEFAULT_EMBED_URL))
|
||||
parser.add_argument("--timeout-s", type=float, default=float(os.environ.get("OPENVINO_CLASSIFIER_TIMEOUT_S", "30")))
|
||||
parser.add_argument("--timeout-s", type=float, default=env_float("OPENVINO_CLASSIFIER_TIMEOUT_S", 30.0))
|
||||
parser.add_argument("--max-batch-size", type=int, default=env_int("OPENVINO_CLASSIFIER_MAX_BATCH_SIZE", DEFAULT_MAX_BATCH_SIZE))
|
||||
parser.add_argument("--no-warmup", action="store_true", help="skip prototype embedding warmup until first request")
|
||||
args = parser.parse_args()
|
||||
|
||||
service = ClassifierService(args.embed_url, timeout_s=args.timeout_s)
|
||||
service = ClassifierService(args.embed_url, timeout_s=args.timeout_s, max_batch_size=args.max_batch_size)
|
||||
if not args.no_warmup:
|
||||
service.warmup()
|
||||
httpd = ThreadingHTTPServer((args.host, args.port), Handler)
|
||||
|
||||
@@ -0,0 +1,113 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Local-only smoke test for the dry-run OpenVINO router classifier.
|
||||
|
||||
This script uses only synthetic fixture messages. It assumes router_classifier.py is
|
||||
already running on localhost and never installs/enables a persistent service.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
DEFAULT_BASE_URL = "http://127.0.0.1:18819"
|
||||
BUSY_FILE = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
FIXTURE = Path(__file__).resolve().parent / "fixtures" / "atlas_hermes_messages.jsonl"
|
||||
|
||||
|
||||
def npu_busy_time_us() -> int | None:
|
||||
try:
|
||||
return int(BUSY_FILE.read_text().strip())
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def get_json(url: str, timeout_s: float) -> dict[str, Any]:
|
||||
with urllib.request.urlopen(url, timeout=timeout_s) as response: # noqa: S310 - localhost smoke URL
|
||||
return json.loads(response.read().decode("utf-8"))
|
||||
|
||||
|
||||
def post_json(url: str, payload: dict[str, Any], timeout_s: float) -> dict[str, Any]:
|
||||
request = urllib.request.Request(
|
||||
url,
|
||||
data=json.dumps(payload).encode("utf-8"),
|
||||
headers={"Content-Type": "application/json"},
|
||||
method="POST",
|
||||
)
|
||||
with urllib.request.urlopen(request, timeout=timeout_s) as response: # noqa: S310 - localhost smoke URL
|
||||
return json.loads(response.read().decode("utf-8"))
|
||||
|
||||
|
||||
def load_fixture(limit: int) -> list[dict[str, Any]]:
|
||||
rows = [json.loads(line) for line in FIXTURE.read_text().splitlines() if line.strip()]
|
||||
return rows[:limit]
|
||||
|
||||
|
||||
def assert_expected(result: dict[str, Any], expected: dict[str, Any]) -> list[str]:
|
||||
failures: list[str] = []
|
||||
labels = result.get("labels", {})
|
||||
for key, value in expected.items():
|
||||
actual_label = labels.get(key, {})
|
||||
actual_value = actual_label.get("value")
|
||||
if actual_value != value:
|
||||
failures.append(f"{result.get('id')}: {key} expected {value!r}, got {actual_value!r}")
|
||||
return failures
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Smoke-test a running localhost router classifier")
|
||||
parser.add_argument("--base-url", default=DEFAULT_BASE_URL)
|
||||
parser.add_argument("--timeout-s", type=float, default=30.0)
|
||||
parser.add_argument("--limit", type=int, default=10)
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.base_url.startswith("http://127.0.0.1:") and not args.base_url.startswith("http://localhost:"):
|
||||
raise SystemExit("refusing non-local base URL; this smoke is localhost-only")
|
||||
|
||||
before = npu_busy_time_us()
|
||||
started = time.perf_counter()
|
||||
try:
|
||||
health = get_json(f"{args.base_url.rstrip('/')}/healthz", args.timeout_s)
|
||||
labels = get_json(f"{args.base_url.rstrip('/')}/v1/labels", args.timeout_s)
|
||||
rows = load_fixture(args.limit)
|
||||
results = []
|
||||
failures: list[str] = []
|
||||
for row in rows:
|
||||
result = post_json(
|
||||
f"{args.base_url.rstrip('/')}/v1/classify",
|
||||
{"id": row["id"], "text": row["text"], "options": {"include_evidence": False, "dry_run": True}},
|
||||
args.timeout_s,
|
||||
)
|
||||
results.append(result)
|
||||
failures.extend(assert_expected(result, row.get("expected", {})))
|
||||
after = npu_busy_time_us()
|
||||
except urllib.error.URLError as exc:
|
||||
raise SystemExit(f"smoke failed: {exc}") from exc
|
||||
|
||||
response_npu_delta = sum((r.get("npu_busy_delta_us") or 0) for r in results)
|
||||
outer_sysfs_delta = None if before is None or after is None else after - before
|
||||
npu_proven = response_npu_delta > 0 and (outer_sysfs_delta is None or outer_sysfs_delta > 0)
|
||||
summary = {
|
||||
"ok": not failures,
|
||||
"service": health.get("service"),
|
||||
"mode": health.get("mode"),
|
||||
"model": health.get("model"),
|
||||
"label_count": len(labels.get("prototype_ids", [])),
|
||||
"fixture_count": len(results),
|
||||
"duration_ms": round((time.perf_counter() - started) * 1000, 3),
|
||||
"response_npu_busy_delta_us": response_npu_delta,
|
||||
"outer_sysfs_npu_busy_delta_us": outer_sysfs_delta,
|
||||
"npu_proven": npu_proven,
|
||||
"failures": failures,
|
||||
}
|
||||
print(json.dumps(summary, indent=2, sort_keys=True))
|
||||
return 0 if not failures and npu_proven else 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -88,6 +88,14 @@ class RouterClassifierTests(unittest.TestCase):
|
||||
self.assertEqual(len(result["results"]), 2)
|
||||
self.assertGreater(result["npu_busy_delta_us"], 0)
|
||||
|
||||
def test_batch_limits_are_enforced(self):
|
||||
svc = self.service()
|
||||
with self.assertRaisesRegex(ValueError, "at least one"):
|
||||
svc.batch_classify([])
|
||||
too_many = [{"id": str(i), "text": "What time is it?"} for i in range(router_classifier.DEFAULT_MAX_BATCH_SIZE + 1)]
|
||||
with self.assertRaisesRegex(ValueError, "max_batch_size"):
|
||||
svc.batch_classify(too_many)
|
||||
|
||||
def test_fixture_file_is_valid_jsonl(self):
|
||||
fixture = ROOT / "fixtures" / "atlas_hermes_messages.jsonl"
|
||||
rows = [json.loads(line) for line in fixture.read_text().splitlines() if line.strip()]
|
||||
@@ -97,6 +105,17 @@ class RouterClassifierTests(unittest.TestCase):
|
||||
self.assertIn("text", row)
|
||||
self.assertIn("expected", row)
|
||||
|
||||
def test_synthetic_fixture_expectations(self):
|
||||
svc = self.service()
|
||||
fixture = ROOT / "fixtures" / "atlas_hermes_messages.jsonl"
|
||||
rows = [json.loads(line) for line in fixture.read_text().splitlines() if line.strip()]
|
||||
for row in rows:
|
||||
with self.subTest(row=row["id"]):
|
||||
result = svc.classify(row["id"], row["text"], {"include_evidence": False})
|
||||
labels = result["labels"]
|
||||
for label_name, expected_value in row["expected"].items():
|
||||
self.assertEqual(labels[label_name]["value"], expected_value)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
# OpenVINO NPU document/image triage prototype
|
||||
|
||||
Local-only prototype for triaging screenshots, photos/scans, and PDF page images.
|
||||
Local-only, CLI-first prototype for triaging screenshots, photos/scans, and PDF page images.
|
||||
It returns structured JSON metadata and explicitly reports CPU vs NPU stages.
|
||||
Optional HTTP is a localhost/loopback-only prototype on `127.0.0.1:18829` when explicitly started; non-loopback binds are rejected and it is not a live Atlas/Hermes/RAG integration.
|
||||
|
||||
Location: `/home/will/lab/swarm/openvino-doc-image-triage-npu/`
|
||||
|
||||
@@ -13,6 +14,8 @@ Location: `/home/will/lab/swarm/openvino-doc-image-triage-npu/`
|
||||
- Full source paths are omitted by default; responses include basename and SHA-256.
|
||||
- Allowed roots are enforced for CLI/server requests.
|
||||
- This prototype does not mutate Obsidian, RAG, Chroma, vector collections, routing, or gateway services.
|
||||
- Do not process broad private document/image directories; use generated synthetic fixtures unless Will explicitly approves a narrow source root.
|
||||
- See `SPEC.md` for the full CLI contract, smoke-test plan, NPU verification plan, docs implications, and no-go/defer criteria.
|
||||
|
||||
## CPU vs NPU stages
|
||||
|
||||
@@ -88,29 +91,31 @@ Include OCR/sidecar text in a single response only when explicitly requested:
|
||||
|
||||
## HTTP usage
|
||||
|
||||
Check that port 18820 is free first:
|
||||
The prototype is CLI-first. HTTP is optional and not enabled by default. If a foreground HTTP server is needed for review, prefer optional port `18829` so it does not collide with the GenAI worker prototype on `18820`. Check the port first:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18820\b' || true
|
||||
ss -ltnp | grep ':18829\b' || true
|
||||
```
|
||||
|
||||
Start local-only server:
|
||||
Start a local-only server and stop it after the smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18820 --allowed-root "$PWD"
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
|
||||
```
|
||||
|
||||
Call it:
|
||||
Call it with synthetic/non-private fixtures only:
|
||||
|
||||
```bash
|
||||
curl -sS http://127.0.0.1:18820/healthz | jq
|
||||
curl -sS http://127.0.0.1:18820/models | jq
|
||||
curl -sS -X POST http://127.0.0.1:18820/triage \
|
||||
curl -sS http://127.0.0.1:18829/healthz | jq
|
||||
curl -sS http://127.0.0.1:18829/models | jq
|
||||
curl -sS -X POST http://127.0.0.1:18829/triage \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"path":"/home/will/lab/swarm/openvino-doc-image-triage-npu/samples/synthetic_invoice.png","options":{"allowed_roots":["/home/will/lab/swarm/openvino-doc-image-triage-npu"]}}' | jq
|
||||
```
|
||||
|
||||
Do not install or enable a persistent service for this prototype without explicit approval, and do not point it at private document/image directories during smoke tests.
|
||||
|
||||
## Smoke test
|
||||
|
||||
```bash
|
||||
@@ -118,7 +123,7 @@ cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python tests/smoke_test.py
|
||||
```
|
||||
|
||||
Expected: JSON ending with `"ok": true`. If the embeddings service is up, the result should show positive NPU busy-time delta and each embedded page should report `verified_npu: true`.
|
||||
Expected: JSON ending with `"ok": true`. The smoke test generates only synthetic fixtures, verifies non-loopback HTTP binds are rejected, starts its temporary server on a preflighted free localhost port, and terminates it before exit. If the embeddings service is up, the result should show positive NPU busy-time delta and each embedded page should report `verified_npu: true`.
|
||||
|
||||
## Example output shape
|
||||
|
||||
|
||||
@@ -0,0 +1,146 @@
|
||||
# OpenVINO NPU document/image triage spec
|
||||
|
||||
Status: CLI-first prototype specification; not a live Atlas/Hermes integration.
|
||||
|
||||
## Safety stance
|
||||
|
||||
- Default workflow is local CLI execution against explicitly named files.
|
||||
- Optional HTTP is disabled unless a human starts it, is constrained to loopback (`127.0.0.1`, `::1`, or `localhost`), and is intended for `127.0.0.1:18829` only.
|
||||
- No persistent systemd unit, Docker service, gateway hook, Atlas/Hermes route, RAG route, Chroma/vector collection mutation, or in-place reindexing is part of this spec.
|
||||
- Smoke data must be synthetic/non-private only. Do not point this tool at Will's private document, image, screenshot, Downloads, Desktop, Obsidian, or photo-library directories without explicit approval.
|
||||
- NPU claims require `/sys/class/accel/accel0/device/npu_busy_time_us` before/after deltas. HTTP 200, JSON output, or model-load success alone is not NPU proof.
|
||||
|
||||
## Recommended model/runtime
|
||||
|
||||
Recommended v1 runtime:
|
||||
|
||||
- File intake, hashing, MIME/extension checks, image/PDF rendering, sidecar/native PDF text extraction, metadata extraction, and category fallback: local Python CPU path using Pillow plus optional `pypdf`/`pypdfium2`.
|
||||
- Needs-attention semantic check: reuse the live localhost OpenVINO embeddings service on `127.0.0.1:18817`, currently `bge-base-en-v1.5-int8-ov`, and verify each embedding call with `npu_busy_time_us` deltas.
|
||||
- Category classification in v1: CPU rule fallback, explicitly reported as not an NPU image model.
|
||||
|
||||
Why this is the recommended v1:
|
||||
|
||||
- It avoids private-data exposure: no external upload path and no broader local file scanning.
|
||||
- It avoids collection/routing risk by using the existing embeddings API as a stateless feature extractor only; it does not write to RAG or Chroma.
|
||||
- It gives a real NPU verification hook for the semantic stage without overclaiming that OCR/image classification are NPU-backed.
|
||||
- It keeps the prototype useful even when optional PDF dependencies or the embeddings service are unavailable: it can fall back to CPU-only metadata/rule output and mark NPU verification false.
|
||||
|
||||
Deferred model work:
|
||||
|
||||
- NPU image category classifier: defer until a static-shape OpenVINO IR image model such as MobileNet/EfficientNet/ResNet is selected, calibrated for the label set, and smoke-tested with busy-time deltas.
|
||||
- NPU OCR/VLM: defer; OCR remains local CPU text plumbing in v1.
|
||||
|
||||
## CLI contract
|
||||
|
||||
Command:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python triage.py \
|
||||
--allowed-root /home/will/lab/swarm/openvino-doc-image-triage-npu \
|
||||
--max-pages 3 \
|
||||
--pretty \
|
||||
samples/synthetic_invoice.png samples/synthetic_invoice.pdf
|
||||
```
|
||||
|
||||
Inputs:
|
||||
|
||||
- Positional `paths`: one or more local image/PDF paths.
|
||||
- `--allowed-root ROOT`: may repeat; every requested path must resolve under one of these roots. Default is current directory.
|
||||
- `--max-pages N`: maximum rendered/extracted PDF pages; default 3.
|
||||
- `--no-embeddings`: disables the localhost `:18817` embedding/NPU check and reports CPU fallback/no text.
|
||||
- `--dry-run`: skip image/PDF rendering while still checking intake/hash/text/metadata where available.
|
||||
- `--include-ocr-text`: include raw extracted/sidecar text in this single response only; off by default.
|
||||
- `--include-full-path`: include resolved full paths; off by default.
|
||||
- `--pretty`: pretty-print JSON.
|
||||
|
||||
Output:
|
||||
|
||||
- Batch JSON: `{ "ok": bool, "files": [...], "generated_at": "..." }`.
|
||||
- Per file result includes `file_id` as `sha256:<digest>`, `source_path_basename`, media type, file size, pages, classification, needs-attention result, metadata counts/flags, privacy flags, and processing-device summary.
|
||||
- Raw OCR/text and full paths are omitted unless explicitly requested.
|
||||
- NPU evidence is per embedding call: `used`, `verified_npu`, `npu_busy_delta_us`, endpoint, and wall time.
|
||||
|
||||
Exit behavior:
|
||||
|
||||
- Exit 0 when all files triage successfully.
|
||||
- Exit 2 when one or more files fail policy/intake/processing checks.
|
||||
|
||||
## Optional localhost HTTP contract
|
||||
|
||||
HTTP is optional and not enabled by this spec. If explicitly started for a smoke or local demo, use localhost and port 18829:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
ss -ltnp | grep ':18829\b' || true
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
|
||||
```
|
||||
|
||||
Endpoints:
|
||||
|
||||
- `GET /healthz` or `/health`: service name, bind policy, configured allowed roots, privacy flags, and current `npu_busy_time_us`.
|
||||
- `GET /models`: reports v1 stages and whether each is CPU or NPU-backed.
|
||||
- `POST /triage`: `{ "path": "/local/file", "options": {...} }` -> `{ "ok": true, "result": ... }`.
|
||||
- `POST /triage/batch`: `{ "paths": ["/local/file"], "options": {...} }` -> batch JSON.
|
||||
|
||||
HTTP privacy/policy rules:
|
||||
|
||||
- Server startup `--allowed-root` is the outer allowlist.
|
||||
- Request `options.allowed_roots` may narrow that allowlist but must not widen it.
|
||||
- Request `options.embedding_url` may only target the configured local loopback embeddings route `http://127.0.0.1:18817/v1/embeddings` (or localhost equivalent); external or alternate endpoints are rejected.
|
||||
- Request bodies and raw text are not logged by the stdlib handler.
|
||||
- Stop the temporary server after the smoke/demo.
|
||||
|
||||
## Synthetic smoke-test plan
|
||||
|
||||
Use only generated fixtures under the prototype directory:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python make_samples.py
|
||||
/home/will/.venvs/npu/bin/python tests/smoke_test.py
|
||||
```
|
||||
|
||||
Expected smoke coverage:
|
||||
|
||||
- Creates synthetic invoice/receipt/form-like image/PDF fixtures.
|
||||
- Runs CLI triage against the synthetic invoice image/PDF under an explicit allowed root.
|
||||
- Asserts privacy flags (`external_uploads: false`, no full path by default).
|
||||
- Asserts invoice category/needs-attention behavior on synthetic text.
|
||||
- Starts a temporary localhost HTTP server on a preflighted free ephemeral port, calls `/healthz` and `/triage`, verifies no full path leakage, rejects attempts to widen allowed roots, rejects external embedding URLs, and verifies non-loopback binds are rejected.
|
||||
- Terminates the temporary server.
|
||||
|
||||
The smoke port in tests should stay OS-assigned ephemeral/non-live to avoid claiming `18829` as a persistent service.
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
For every test that claims NPU use:
|
||||
|
||||
1. Read `/sys/class/accel/accel0/device/npu_busy_time_us` before the operation.
|
||||
2. Perform an operation that should call the live embeddings service on `127.0.0.1:18817` with non-empty synthetic text.
|
||||
3. Read `npu_busy_time_us` after the operation.
|
||||
4. Require both:
|
||||
- the per-result embedding object reports `used: true`, `verified_npu: true`, and `npu_busy_delta_us > 0`; and
|
||||
- the outer before/after sysfs value increased.
|
||||
5. If sysfs is missing or `:18817` is unavailable, do not claim NPU success; report CPU fallback / embedding unavailable and keep the smoke result honest.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
- Service maps should list document/image triage as CLI-first and optional prototype `127.0.0.1:18829`, not live unless explicitly started.
|
||||
- Diagrams must not draw live Atlas/Hermes/gateway/RAG routing to this triage lane.
|
||||
- If shown with other candidate sidecars, label it separately from live services: live baseline remains RAG `:18810`, Whisper NPU `:18816`, and embeddings `:18817`; prototype sidecars are reranker `:18818`, classifier/router `:18819`, GenAI worker `:18820`, and optional doc/image triage `:18829`.
|
||||
- Runbooks should include CLI smoke, localhost listener checks, busy-time delta verification, and server shutdown instructions.
|
||||
- Documentation should state CPU vs NPU stages explicitly so the prototype does not imply NPU OCR or NPU image classification.
|
||||
|
||||
## No-go / defer criteria
|
||||
|
||||
Do not proceed to implementation, live integration, or persistent service enablement if any of these are true:
|
||||
|
||||
- Will has not explicitly approved live routing or persistent service enablement.
|
||||
- The requested source path is a private document/image directory or broad home-directory scan rather than synthetic fixtures or an explicitly approved narrow root.
|
||||
- The workflow would mutate Obsidian, RAG, Chroma/vector collections, or reindex in place.
|
||||
- The optional server would need to bind anywhere other than localhost.
|
||||
- NPU busy-time does not increase for an operation being described as NPU-backed.
|
||||
- Raw OCR text or full paths would be logged, uploaded, stored durably, or returned without explicit request.
|
||||
- PDF/image dependencies are missing and the task requires rendered page analysis rather than metadata/text-only fallback.
|
||||
- A future image classifier/OCR/VLM model has not been selected, converted/quantized to OpenVINO, calibrated for the task, and verified on synthetic fixtures with busy-time deltas.
|
||||
@@ -13,6 +13,7 @@ configured allowed roots. It never uploads document/image contents externally.
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import ipaddress
|
||||
import json
|
||||
import os
|
||||
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
||||
@@ -23,6 +24,19 @@ from urllib.parse import urlparse
|
||||
from triage import DEFAULT_EMBED_URL, TriageOptions, read_npu_busy, triage_batch, triage_file
|
||||
|
||||
|
||||
def _validate_loopback_host(host: str) -> str:
|
||||
"""Reject non-loopback binds; this prototype is never a LAN service."""
|
||||
normalized = host.strip()
|
||||
if normalized == "localhost":
|
||||
return normalized
|
||||
try:
|
||||
if ipaddress.ip_address(normalized).is_loopback:
|
||||
return normalized
|
||||
except ValueError:
|
||||
pass
|
||||
raise ValueError("host must be localhost/loopback for this prototype")
|
||||
|
||||
|
||||
def _roots_within_configured(requested_roots: list[Any], configured_roots: list[Path]) -> list[Path]:
|
||||
"""Return request roots only when they narrow the startup allowlist."""
|
||||
narrowed: list[Path] = []
|
||||
@@ -163,13 +177,17 @@ class Handler(BaseHTTPRequestHandler):
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description="Local-only doc/image triage HTTP server")
|
||||
parser.add_argument("--host", default=os.environ.get("DOC_IMAGE_TRIAGE_HOST", "127.0.0.1"))
|
||||
parser.add_argument("--port", type=int, default=int(os.environ.get("DOC_IMAGE_TRIAGE_PORT", "18820")))
|
||||
parser.add_argument("--port", type=int, default=int(os.environ.get("DOC_IMAGE_TRIAGE_PORT", "18829")))
|
||||
parser.add_argument("--allowed-root", action="append", default=[], help="allowed local root; may repeat")
|
||||
args = parser.parse_args()
|
||||
try:
|
||||
host = _validate_loopback_host(args.host)
|
||||
except ValueError as exc:
|
||||
parser.error(str(exc))
|
||||
roots = [Path(p).expanduser().resolve() for p in args.allowed_root] or [Path.cwd().resolve()]
|
||||
httpd = ThreadingHTTPServer((args.host, args.port), Handler)
|
||||
httpd = ThreadingHTTPServer((host, args.port), Handler)
|
||||
httpd.allowed_roots = roots # type: ignore[attr-defined]
|
||||
print(json.dumps({"service": "openvino-doc-image-triage-npu", "host": args.host, "port": args.port, "allowed_roots": [str(p) for p in roots]}), flush=True)
|
||||
print(json.dumps({"service": "openvino-doc-image-triage-npu", "host": host, "port": args.port, "allowed_roots": [str(p) for p in roots]}), flush=True)
|
||||
httpd.serve_forever()
|
||||
return 0
|
||||
|
||||
|
||||
@@ -2,6 +2,7 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import socket
|
||||
import subprocess
|
||||
import sys
|
||||
import tempfile
|
||||
@@ -42,6 +43,29 @@ def busy() -> int | None:
|
||||
return None
|
||||
|
||||
|
||||
def choose_free_loopback_port() -> int:
|
||||
"""Ask the OS for a free localhost port and verify it is not listening yet."""
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
|
||||
sock.bind(("127.0.0.1", 0))
|
||||
port = int(sock.getsockname()[1])
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as probe:
|
||||
probe.settimeout(0.25)
|
||||
assert probe.connect_ex(("127.0.0.1", port)) != 0, f"selected port already has a listener: {port}"
|
||||
return port
|
||||
|
||||
|
||||
def assert_loopback_bind_policy() -> None:
|
||||
blocked = subprocess.run(
|
||||
[sys.executable, "server.py", "--host", "0.0.0.0", "--port", "0", "--allowed-root", str(ROOT)],
|
||||
cwd=ROOT,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
)
|
||||
assert blocked.returncode != 0, blocked.stdout + blocked.stderr
|
||||
assert "loopback" in blocked.stderr.lower(), blocked.stderr
|
||||
|
||||
|
||||
def main() -> int:
|
||||
run([sys.executable, "make_samples.py"])
|
||||
invoice = SAMPLES / "synthetic_invoice.png"
|
||||
@@ -69,20 +93,23 @@ def main() -> int:
|
||||
assert (emb.get("npu_busy_delta_us") or 0) > 0, emb
|
||||
assert after > before, {"before": before, "after": after, "embedding": emb}
|
||||
|
||||
# HTTP smoke on an ephemeral localhost port so we do not collide with 18820 during tests.
|
||||
proc = subprocess.Popen([sys.executable, "server.py", "--host", "127.0.0.1", "--port", "18828", "--allowed-root", str(ROOT)], cwd=ROOT, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
|
||||
# HTTP smoke on a preflighted free localhost port so we do not collide with live/prototype ports.
|
||||
assert_loopback_bind_policy()
|
||||
smoke_port = choose_free_loopback_port()
|
||||
base_url = f"http://127.0.0.1:{smoke_port}"
|
||||
proc = subprocess.Popen([sys.executable, "server.py", "--host", "127.0.0.1", "--port", str(smoke_port), "--allowed-root", str(ROOT)], cwd=ROOT, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
|
||||
try:
|
||||
deadline = time.time() + 5
|
||||
while time.time() < deadline:
|
||||
try:
|
||||
health = urllib.request.urlopen("http://127.0.0.1:18828/healthz", timeout=1).read()
|
||||
health = urllib.request.urlopen(f"{base_url}/healthz", timeout=1).read()
|
||||
assert b"openvino-doc-image-triage-npu" in health
|
||||
break
|
||||
except Exception:
|
||||
time.sleep(0.1)
|
||||
else:
|
||||
raise AssertionError("server did not become ready")
|
||||
resp = post_json("http://127.0.0.1:18828/triage", {"path": str(invoice), "options": {"allowed_roots": [str(ROOT)]}})
|
||||
resp = post_json(f"{base_url}/triage", {"path": str(invoice), "options": {"allowed_roots": [str(ROOT)]}})
|
||||
assert resp["ok"] is True, resp
|
||||
assert resp["result"]["source_path_basename"] == "synthetic_invoice.png"
|
||||
assert "source_path" not in resp["result"]
|
||||
@@ -92,7 +119,7 @@ def main() -> int:
|
||||
outside.write(b"sensitive text outside configured artifact root")
|
||||
outside.flush()
|
||||
status, blocked = post_json_status(
|
||||
"http://127.0.0.1:18828/triage",
|
||||
f"{base_url}/triage",
|
||||
{"path": outside.name, "options": {"allowed_roots": ["/tmp"], "dry_run": True, "use_embeddings": False}},
|
||||
)
|
||||
assert status == 400, blocked
|
||||
@@ -101,7 +128,7 @@ def main() -> int:
|
||||
|
||||
# Request bodies must not redirect extracted text to caller-supplied endpoints.
|
||||
status, blocked = post_json_status(
|
||||
"http://127.0.0.1:18828/triage",
|
||||
f"{base_url}/triage",
|
||||
{"path": str(invoice), "options": {"embedding_url": "http://198.51.100.1:9/v1/embeddings"}},
|
||||
)
|
||||
assert status == 400, blocked
|
||||
|
||||
@@ -0,0 +1,306 @@
|
||||
# Bounded OpenVINO GenAI NPU worker contract
|
||||
|
||||
Status: prototype contract implemented locally; not a live Atlas/Hermes routing dependency.
|
||||
Default address: `http://127.0.0.1:18820`.
|
||||
|
||||
## Purpose and hard boundary
|
||||
|
||||
This worker is a local-only sidecar for small, bounded generation jobs that are useful around the assistant stack but are not primary chat: title drafting, short summaries, notification condensation, and memory-candidate extraction. It must not be used as Atlas/Hermes primary model routing, gateway fallback routing, autonomous tool-calling, or an unbounded chat endpoint without a separate approval gate.
|
||||
|
||||
Hard boundaries:
|
||||
|
||||
- Bind to `127.0.0.1` by default; non-local bind is a code/ops review item, not a runtime flag to casually change.
|
||||
- Do not enable a persistent systemd/Docker service as part of smoke testing.
|
||||
- Do not restart or reconfigure Atlas, Hermes, gateway, LiteLLM, RAG, or n8n routing to call this worker without explicit approval from Will.
|
||||
- Do not write memory, mutate Chroma/vector collections, trigger RAG reindexing, or process private document/image directories.
|
||||
- Do not log raw prompts or raw request bodies by default.
|
||||
- Treat HTTP success as insufficient for NPU claims; require positive `/sys/class/accel/accel0/device/npu_busy_time_us` delta for generation.
|
||||
|
||||
## Recommended model/runtime
|
||||
|
||||
Recommended first model:
|
||||
|
||||
- Model id: `OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov`
|
||||
- Local path: `/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov`
|
||||
- Runtime: `/home/will/.venvs/npu` with `openvino-genai==2026.2.0.0`
|
||||
- Device: OpenVINO GenAI `NPU`
|
||||
- Compile cache: `/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4`
|
||||
|
||||
Why this model/runtime:
|
||||
|
||||
- It is already staged in the repo prototype and has a local smoke observation with positive NPU busy-time delta.
|
||||
- It is an OpenVINO IR model with INT4-compressed weights, which keeps memory/compile pressure low enough for a sidecar on the shared NPU.
|
||||
- Qwen2.5-1.5B-Instruct is large enough for formatting/summarization/notification jobs but small enough to keep latency bounded. It should not be marketed as a high-quality general assistant model.
|
||||
- The Hugging Face model card identifies it as Qwen2.5-1.5B-Instruct converted to OpenVINO IR with INT4_SYM NNCF weight compression and states compatibility with OpenVINO 2025.1.0+; the local runtime is newer than that baseline.
|
||||
- OpenVINO GenAI `LLMPipeline` is the right first runtime because the existing local NPU stack already uses OpenVINO GenAI successfully for Whisper, and it exposes a simple bounded generate call with cache controls.
|
||||
|
||||
Deferred alternatives:
|
||||
|
||||
- Larger 3B/7B local LLMs: defer until the 1.5B contract proves stable; larger models increase compile time, memory pressure, and NPU contention.
|
||||
- CPU/GPU fallback inside this service: defer; fallback would blur the NPU verification contract. If fallback is later approved, return `device_actual` and keep NPU-only health separate.
|
||||
- Manual `EXPORT_BLOB`/`BLOB_PATH`: defer until compile latency is proven to dominate despite `CACHE_DIR`. If used later, record OpenVINO version, NPU compiler/driver versions, model id, quantization flags, and source model path; invalidate after OpenVINO/NPU driver upgrades.
|
||||
|
||||
## Runtime bounds
|
||||
|
||||
Pipeline configuration for the first milestone:
|
||||
|
||||
```text
|
||||
CACHE_DIR=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
||||
MAX_PROMPT_LEN=1024
|
||||
MIN_RESPONSE_LEN=64
|
||||
PREFILL_HINT=DYNAMIC
|
||||
GENERATE_HINT=FAST_COMPILE
|
||||
```
|
||||
|
||||
Request bounds:
|
||||
|
||||
- `input`: required non-empty string; max `6000` characters before prompt templating.
|
||||
- `job`: one of `title`, `summary`, `notification`, `memory_candidate`.
|
||||
- `max_new_tokens`: optional; default by job; hard max `256`.
|
||||
- Concurrency: generation must be serialized inside the process with a lock because the NPU is shared with Whisper/embeddings/prototype sidecars.
|
||||
- Logging: log method/path/status and timing only; never log raw `input` or generated text by default.
|
||||
|
||||
Expected latency target:
|
||||
|
||||
- Cold-ish first generation with cache available: acceptable if roughly 15 seconds or less for a short prompt on the staged model.
|
||||
- Warm short jobs: target under 5 seconds for `title`/`notification` and under 10 seconds for `summary`/`memory_candidate`.
|
||||
- Defer promotion if p95 warm latency exceeds 15 seconds for 24-96 generated tokens, or if cold compile regularly blocks the NPU long enough to degrade live Whisper/embeddings.
|
||||
|
||||
These are prototype acceptance targets, not SLOs for live Atlas routing.
|
||||
|
||||
## CLI contract
|
||||
|
||||
Command shape:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py \
|
||||
--job title \
|
||||
--input 'Synthetic non-private text to title.' \
|
||||
--max-new-tokens 32
|
||||
```
|
||||
|
||||
CLI stdout is JSON with the same response shape as HTTP generation. Exit code must be:
|
||||
|
||||
- `0` when the job succeeds and `npu_busy_delta_us > 0`.
|
||||
- non-zero when input validation fails, model load/generation fails, or NPU busy-time delta is not positive.
|
||||
|
||||
The CLI must not write memory, change service routing, or start persistent services.
|
||||
|
||||
## HTTP contract
|
||||
|
||||
Start temporary local server only:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
|
||||
```
|
||||
|
||||
Endpoints:
|
||||
|
||||
```text
|
||||
GET /healthz
|
||||
GET /models
|
||||
POST /v1/worker/generate
|
||||
POST /v1/worker/extract-memory-candidates
|
||||
POST /v1/worker/condense-notification
|
||||
```
|
||||
|
||||
`GET /healthz` response fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"model_path": "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"device": "NPU",
|
||||
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4",
|
||||
"cache_exists": true,
|
||||
"loaded": false,
|
||||
"initial_load_ms": null,
|
||||
"busy_time_us": 0,
|
||||
"max_input_chars": 6000,
|
||||
"jobs": ["memory_candidate", "notification", "summary", "title"],
|
||||
"bind": "127.0.0.1:18820"
|
||||
}
|
||||
```
|
||||
|
||||
`POST /v1/worker/generate` request:
|
||||
|
||||
```json
|
||||
{
|
||||
"job": "summary",
|
||||
"input": "Synthetic non-private text to summarize.",
|
||||
"max_new_tokens": 80
|
||||
}
|
||||
```
|
||||
|
||||
Specialized aliases:
|
||||
|
||||
- `POST /v1/worker/extract-memory-candidates` implies `job=memory_candidate`.
|
||||
- `POST /v1/worker/condense-notification` implies `job=notification`.
|
||||
- Backward-compatible request `job=memory` may map to `memory_candidate`, but new clients should use `memory_candidate`.
|
||||
|
||||
Successful generation response:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov",
|
||||
"device": "NPU",
|
||||
"job": "summary",
|
||||
"text": "...",
|
||||
"json": null,
|
||||
"timing_ms": {
|
||||
"load": 0.0,
|
||||
"initial_load": 10989.08,
|
||||
"generate": 3157.94,
|
||||
"total": 3157.94
|
||||
},
|
||||
"npu_busy_delta_us": 2650724,
|
||||
"npu_busy_before_us": 123,
|
||||
"npu_busy_after_us": 2650847,
|
||||
"cache_dir": "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
|
||||
}
|
||||
```
|
||||
|
||||
Validation/error behavior:
|
||||
|
||||
- Unsupported path: `404` JSON `{"error":"not found"}`.
|
||||
- Unsupported job, empty input, too-long input, invalid token bound, missing model, or generation failure: JSON `{"error":"..."}` with non-2xx preferred for future implementations. The current stdlib prototype returns `400` for these errors.
|
||||
- If `npu_busy_delta_us <= 0`, the response should be treated as failed by smoke tests even if an HTTP handler emitted `200`; the refreshed prototype returns `503` with the generation payload plus an `error` field.
|
||||
|
||||
## Prompt/job contract
|
||||
|
||||
`title`:
|
||||
|
||||
- Input: short task/log/message excerpt.
|
||||
- Output: one title, 8 words or fewer, no markdown required.
|
||||
- Default `max_new_tokens`: 32.
|
||||
|
||||
`summary`:
|
||||
|
||||
- Input: synthetic/non-private text excerpt.
|
||||
- Output: one short paragraph or up to 4 bullets.
|
||||
- Default `max_new_tokens`: 160.
|
||||
|
||||
`notification`:
|
||||
|
||||
- Input: synthetic/non-private alert/log excerpt.
|
||||
- Output target: JSON object with `severity`, `category`, `summary`, `action_needed`.
|
||||
- Default `max_new_tokens`: 96.
|
||||
- Client must tolerate `json: null` and parse/validate before using output.
|
||||
|
||||
`memory_candidate`:
|
||||
|
||||
- Input: synthetic/non-private conversation excerpt.
|
||||
- Output target: JSON object with `candidates` and `notes`; candidates are proposals only.
|
||||
- Default `max_new_tokens`: 192.
|
||||
- This worker must never call Hermes memory tools or write durable memory directly.
|
||||
|
||||
## Smoke-test plan using non-private data
|
||||
|
||||
Do not use private vault notes, screenshots, email, chat logs, or document/image directories. Use synthetic text like this:
|
||||
|
||||
```text
|
||||
Atlas received a kanban notification that an OpenVINO NPU prototype finished smoke testing. The reviewer needs a concise status and next action. No live gateway routing changed.
|
||||
```
|
||||
|
||||
Direct NPU smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
/home/will/.venvs/npu/bin/python smoke_llm_npu.py \
|
||||
--prompt 'Write a concise title for: synthetic NPU worker contract smoke.' \
|
||||
--max-new-tokens 24
|
||||
status=$?
|
||||
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
printf 'external_busy_delta_us=%s\n' "$((after-before))"
|
||||
test "$status" -eq 0
|
||||
test "$((after-before))" -gt 0
|
||||
```
|
||||
|
||||
Temporary HTTP smoke:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820 &
|
||||
pid=$!
|
||||
trap 'kill "$pid" 2>/dev/null || true' EXIT
|
||||
|
||||
curl -fsS http://127.0.0.1:18820/healthz | python -m json.tool
|
||||
before=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
curl -fsS http://127.0.0.1:18820/v1/worker/generate \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"job":"title","input":"Synthetic NPU worker smoke with no routing changes.","max_new_tokens":24}' \
|
||||
| tee /tmp/openvino-genai-worker-smoke.json \
|
||||
| python -m json.tool
|
||||
after=$(cat /sys/class/accel/accel0/device/npu_busy_time_us)
|
||||
python - <<'PY'
|
||||
import json
|
||||
p=json.load(open('/tmp/openvino-genai-worker-smoke.json'))
|
||||
assert p['npu_busy_delta_us'] > 0, p
|
||||
assert p['device'] == 'NPU', p
|
||||
PY
|
||||
test "$((after-before))" -gt 0
|
||||
kill "$pid"
|
||||
trap - EXIT
|
||||
```
|
||||
|
||||
Also verify the temporary listener is gone:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18820' && { echo 'temporary smoke server still running'; exit 1; } || true
|
||||
```
|
||||
|
||||
Unit tests that do not load the model or require private data:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
python -m pytest -q
|
||||
```
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
Acceptance for any NPU claim requires all of the following:
|
||||
|
||||
1. Confirm the sysfs counter exists and is readable:
|
||||
`test -r /sys/class/accel/accel0/device/npu_busy_time_us`.
|
||||
2. Read `busy_before` immediately before the generation call.
|
||||
3. Run exactly one bounded generation against the candidate worker.
|
||||
4. Read `busy_after` immediately after generation completes.
|
||||
5. Require `busy_after > busy_before` and response `npu_busy_delta_us > 0`.
|
||||
6. Record model id, runtime version, prompt chars, max tokens, load/generate timings, and busy delta in the review handoff.
|
||||
7. If the counter is unchanged, mark the smoke as failed even if HTTP returned `200` and text was generated.
|
||||
|
||||
Because the NPU is shared, a positive external delta proves NPU activity during the window but not exclusive attribution. Prefer a quiet window with no concurrent Whisper/embedding jobs for review-grade measurements; otherwise repeat and compare worker-reported internal delta with the external counter.
|
||||
|
||||
## Docs/diagram implications
|
||||
|
||||
If this worker is kept as a prototype, docs and diagrams should show:
|
||||
|
||||
- Live baseline remains RAG `:18810`, Whisper NPU `:18816`, embeddings `:18817`.
|
||||
- GenAI worker `:18820` is proposed/prototype/not-live unless explicitly approved and enabled.
|
||||
- No arrow from Hermes/Atlas gateway or LiteLLM primary routing to `:18820` unless a later approved integration actually exists.
|
||||
- Runbooks should include the CLI/HTTP smoke commands, `ss` listener checks, and NPU busy-time counter checks.
|
||||
- Service maps should label this as "bounded background generation" rather than "chat" or "assistant model".
|
||||
|
||||
## Explicit no-go / defer criteria
|
||||
|
||||
No-go for implementation or promotion:
|
||||
|
||||
- Model path missing, OpenVINO GenAI import fails, or NPU device is unavailable.
|
||||
- `/sys/class/accel/accel0/device/npu_busy_time_us` is unreadable or does not increase during generation.
|
||||
- Warm bounded jobs exceed the prototype latency target or starve live Whisper/embedding services.
|
||||
- The worker needs private documents/images/chat logs for smoke testing.
|
||||
- The worker requires Atlas/Hermes/gateway/LiteLLM/RAG routing changes to demonstrate value.
|
||||
- The API starts accepting arbitrary chat history, tool-call instructions, unbounded prompts, or large outputs.
|
||||
- The service logs raw prompt bodies by default.
|
||||
- Persistent service enablement is requested without an explicit Will approval gate and a reviewer smoke handoff.
|
||||
|
||||
Defer, do not solve in this lane:
|
||||
|
||||
- Primary assistant routing, LiteLLM model registration, gateway fallback, or tool-calling integration.
|
||||
- RAG query rewriting, RAG answer generation, or collection mutation.
|
||||
- Private document/image triage.
|
||||
- Multi-model selection, CPU/GPU fallback policy, batching, streaming, or auth exposure beyond localhost.
|
||||
@@ -15,8 +15,10 @@ The worker does not write memory, does not restart Atlas/Hermes, does not change
|
||||
|
||||
## Files
|
||||
|
||||
- `CONTRACT.md` — bounded-worker service contract, endpoint/CLI API, smoke plan, NPU verification, docs implications, and no-go criteria.
|
||||
- `worker.py` — stdlib HTTP API plus CLI wrapper.
|
||||
- `smoke_llm_npu.py` — direct GenAI smoke test with NPU busy-time verification.
|
||||
- `tests/test_worker.py` — unit tests with a fake GenAI pipeline and synthetic busy-time counter.
|
||||
- `systemd/openvino-genai-npu-worker.service` — optional user-service template; not installed by this prototype.
|
||||
|
||||
## Model/cache
|
||||
@@ -72,15 +74,20 @@ Observed cold-ish smoke after download/cache setup:
|
||||
--input 'Kanban task asks for a small OpenVINO GenAI NPU worker prototype.'
|
||||
```
|
||||
|
||||
Exit code is non-zero if validation fails, generation fails, or the worker-reported `npu_busy_delta_us` is not positive.
|
||||
|
||||
## HTTP usage
|
||||
|
||||
Start locally only:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
ss -ltnp | grep ':18820' && { echo 'port 18820 already in use'; exit 1; } || true
|
||||
/home/will/.venvs/npu/bin/python worker.py --host 127.0.0.1 --port 18820
|
||||
```
|
||||
|
||||
The server also refuses startup if a listener is already accepting connections on `127.0.0.1:18820`.
|
||||
|
||||
Endpoints:
|
||||
|
||||
```text
|
||||
@@ -102,6 +109,30 @@ curl -s http://127.0.0.1:18820/v1/worker/generate \
|
||||
|
||||
Response includes `npu_busy_delta_us`; treat zero as failure even if HTTP status is 200.
|
||||
|
||||
## Unit tests
|
||||
|
||||
These tests use only synthetic strings and a fake GenAI pipeline, so they do not load the model or touch private data:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm/openvino-genai-npu-worker
|
||||
python -m pytest -q
|
||||
```
|
||||
|
||||
## Environment variables
|
||||
|
||||
```text
|
||||
OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
|
||||
OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
||||
OV_GENAI_NPU_HOST=127.0.0.1
|
||||
OV_GENAI_NPU_PORT=18820
|
||||
```
|
||||
|
||||
Only `127.0.0.1` is accepted by the current prototype; wider binds require an explicit code change and approval.
|
||||
|
||||
## Optional systemd user service
|
||||
|
||||
A draft unit exists at `systemd/openvino-genai-npu-worker.service` for later review. Do not copy, enable, or autostart it unless Will explicitly approves persistent service enablement. Foreground smoke on `127.0.0.1:18820` plus positive sysfs NPU busy-time delta is required before any installation discussion.
|
||||
|
||||
## Safety boundaries
|
||||
|
||||
- Binds only to `127.0.0.1` by default; non-local bind is refused in code.
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
[pytest]
|
||||
testpaths = tests
|
||||
@@ -10,31 +10,42 @@ import argparse
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import openvino_genai as ov_genai
|
||||
from typing import Any
|
||||
|
||||
DEFAULT_MODEL = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
|
||||
DEFAULT_CACHE = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
|
||||
BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
|
||||
|
||||
def read_busy() -> int:
|
||||
return int(BUSY_PATH.read_text().strip())
|
||||
def import_openvino_genai() -> Any:
|
||||
import openvino_genai as ov_genai # type: ignore[import-not-found]
|
||||
|
||||
return ov_genai
|
||||
|
||||
|
||||
def read_busy(path: Path = BUSY_PATH) -> int:
|
||||
return int(path.read_text().strip())
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--model", default=DEFAULT_MODEL)
|
||||
parser.add_argument("--cache-dir", default=DEFAULT_CACHE)
|
||||
parser.add_argument("--prompt", default="Write a concise title for: User asked Atlas to summarize NPU worker options.")
|
||||
parser.add_argument("--busy-path", default=str(BUSY_PATH))
|
||||
parser.add_argument("--prompt", default="Write a concise title for: Synthetic NPU worker contract smoke with no routing changes.")
|
||||
parser.add_argument("--max-new-tokens", type=int, default=24)
|
||||
args = parser.parse_args()
|
||||
|
||||
model_path = Path(args.model)
|
||||
cache_dir = Path(args.cache_dir)
|
||||
busy_path = Path(args.busy_path)
|
||||
cache_dir.mkdir(parents=True, exist_ok=True)
|
||||
if not model_path.exists():
|
||||
raise SystemExit(f"model path does not exist: {model_path}")
|
||||
if not busy_path.exists():
|
||||
raise SystemExit(f"NPU busy-time counter does not exist: {busy_path}")
|
||||
if args.max_new_tokens < 1 or args.max_new_tokens > 256:
|
||||
raise SystemExit("max-new-tokens must be between 1 and 256")
|
||||
|
||||
config = {
|
||||
"CACHE_DIR": str(cache_dir),
|
||||
@@ -44,15 +55,16 @@ def main() -> int:
|
||||
"GENERATE_HINT": "FAST_COMPILE",
|
||||
}
|
||||
|
||||
before = read_busy()
|
||||
ov_genai = import_openvino_genai()
|
||||
before = read_busy(busy_path)
|
||||
load_start = time.monotonic()
|
||||
pipe = ov_genai.LLMPipeline(str(model_path), "NPU", config)
|
||||
pipe = ov_genai.LLMPipeline(str(model_path), "NPU", **config)
|
||||
load_ms = round((time.monotonic() - load_start) * 1000, 2)
|
||||
|
||||
gen_start = time.monotonic()
|
||||
output = pipe.generate(args.prompt, max_new_tokens=args.max_new_tokens)
|
||||
gen_ms = round((time.monotonic() - gen_start) * 1000, 2)
|
||||
after = read_busy()
|
||||
after = read_busy(busy_path)
|
||||
result = {
|
||||
"model": str(model_path),
|
||||
"device": "NPU",
|
||||
|
||||
@@ -7,6 +7,7 @@ Type=simple
|
||||
WorkingDirectory=/home/will/lab/swarm/openvino-genai-npu-worker
|
||||
Environment=OV_GENAI_NPU_MODEL=/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov
|
||||
Environment=OV_GENAI_NPU_CACHE=/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4
|
||||
Environment=OV_GENAI_NPU_HOST=127.0.0.1
|
||||
Environment=OV_GENAI_NPU_PORT=18820
|
||||
ExecStart=/home/will/.venvs/npu/bin/python /home/will/lab/swarm/openvino-genai-npu-worker/worker.py --host 127.0.0.1 --port 18820
|
||||
Restart=on-failure
|
||||
|
||||
@@ -0,0 +1,131 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
import worker
|
||||
|
||||
|
||||
class FakePipeline:
|
||||
def __init__(self, model_path: str, device: str, config: dict[str, object], busy_path: Path, output: str = "Synthetic title"):
|
||||
self.model_path = model_path
|
||||
self.device = device
|
||||
self.config = config
|
||||
self.busy_path = busy_path
|
||||
self.output = output
|
||||
self.calls: list[tuple[str, int]] = []
|
||||
|
||||
def generate(self, prompt: str, *, max_new_tokens: int):
|
||||
self.calls.append((prompt, max_new_tokens))
|
||||
before = int(self.busy_path.read_text().strip())
|
||||
self.busy_path.write_text(str(before + 1234))
|
||||
return self.output
|
||||
|
||||
|
||||
class FakeGenAI:
|
||||
def __init__(self, busy_path: Path, output: str = "Synthetic title"):
|
||||
self.busy_path = busy_path
|
||||
self.output = output
|
||||
self.pipeline: FakePipeline | None = None
|
||||
|
||||
def LLMPipeline(self, model_path: str, device: str, *args: object, **kwargs: object): # noqa: N802 - mirrors OpenVINO API
|
||||
if args and isinstance(args[0], dict):
|
||||
config: dict[str, object] = {str(k): v for k, v in args[0].items()}
|
||||
else:
|
||||
config = dict(kwargs)
|
||||
self.pipeline = FakePipeline(model_path, device, config, self.busy_path, self.output)
|
||||
return self.pipeline
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def worker_paths(tmp_path: Path):
|
||||
model_path = tmp_path / "model"
|
||||
cache_dir = tmp_path / "cache"
|
||||
busy_path = tmp_path / "npu_busy_time_us"
|
||||
model_path.mkdir()
|
||||
busy_path.write_text("100")
|
||||
return model_path, cache_dir, busy_path
|
||||
|
||||
|
||||
def test_generate_uses_npu_config_and_reports_busy_delta(monkeypatch: pytest.MonkeyPatch, worker_paths):
|
||||
model_path, cache_dir, busy_path = worker_paths
|
||||
fake_genai = FakeGenAI(busy_path)
|
||||
monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
|
||||
|
||||
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_port=18820)
|
||||
result = npu_worker.generate("title", "Synthetic non-private kanban notification.", max_new_tokens=24)
|
||||
|
||||
assert result.npu_busy_before_us == 100
|
||||
assert result.npu_busy_after_us == 1334
|
||||
assert result.npu_busy_delta_us == 1234
|
||||
assert result.text == "Synthetic title"
|
||||
assert fake_genai.pipeline is not None
|
||||
assert fake_genai.pipeline.device == "NPU"
|
||||
assert fake_genai.pipeline.config["CACHE_DIR"] == str(cache_dir)
|
||||
assert fake_genai.pipeline.config["MAX_PROMPT_LEN"] == 1024
|
||||
assert fake_genai.pipeline.calls[0][1] == 24
|
||||
|
||||
|
||||
def test_memory_alias_json_wrapping(monkeypatch: pytest.MonkeyPatch, worker_paths):
|
||||
model_path, cache_dir, busy_path = worker_paths
|
||||
fake_genai = FakeGenAI(busy_path, output='[{"fact":"synthetic stable preference","confidence":0.8}]')
|
||||
monkeypatch.setattr(worker, "import_openvino_genai", lambda: fake_genai)
|
||||
|
||||
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
|
||||
result = npu_worker.generate("memory_candidate", "Synthetic user says they prefer concise answers.")
|
||||
|
||||
assert result.parsed_json is not None
|
||||
assert result.parsed_json["candidates"][0]["fact"] == "synthetic stable preference"
|
||||
assert "wrapped" in result.parsed_json["notes"]
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("job", "user_input", "max_new_tokens", "message"),
|
||||
[
|
||||
("bad", "hello", 1, "unsupported job"),
|
||||
("title", "", 1, "non-empty"),
|
||||
("title", "x" * (worker.MAX_INPUT_CHARS + 1), 1, "input too long"),
|
||||
("title", "hello", worker.MAX_NEW_TOKENS + 1, "max_new_tokens"),
|
||||
],
|
||||
)
|
||||
def test_validation_errors(monkeypatch: pytest.MonkeyPatch, worker_paths, job: str, user_input: str, max_new_tokens: int, message: str):
|
||||
model_path, cache_dir, busy_path = worker_paths
|
||||
monkeypatch.setattr(worker, "import_openvino_genai", lambda: FakeGenAI(busy_path))
|
||||
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
|
||||
|
||||
with pytest.raises(ValueError, match=message):
|
||||
npu_worker.generate(job, user_input, max_new_tokens=max_new_tokens)
|
||||
|
||||
|
||||
def test_health_reports_actual_bind_and_limits(worker_paths):
|
||||
model_path, cache_dir, busy_path = worker_paths
|
||||
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path, bind_host="127.0.0.1", bind_port=18821)
|
||||
|
||||
health = npu_worker.health()
|
||||
|
||||
assert health["bind"] == "127.0.0.1:18821"
|
||||
assert health["max_input_chars"] == 6000
|
||||
assert health["max_new_tokens"] == 256
|
||||
assert health["busy_time_us"] == 100
|
||||
|
||||
|
||||
def test_response_payload_shape(worker_paths):
|
||||
model_path, cache_dir, busy_path = worker_paths
|
||||
npu_worker = worker.NpuWorker(str(model_path), str(cache_dir), busy_path=busy_path)
|
||||
result = worker.GenerationResult(
|
||||
text="ok",
|
||||
parsed_json={"severity": "info"},
|
||||
timing_ms={"load": 1.0, "initial_load": 1.0, "generate": 2.0, "total": 3.0},
|
||||
npu_busy_delta_us=5,
|
||||
npu_busy_before_us=10,
|
||||
npu_busy_after_us=15,
|
||||
)
|
||||
|
||||
payload = worker.response_payload(npu_worker, "notification", result)
|
||||
|
||||
assert json.dumps(payload)
|
||||
assert payload["device"] == "NPU"
|
||||
assert payload["job"] == "notification"
|
||||
assert payload["json"] == {"severity": "info"}
|
||||
@@ -10,6 +10,7 @@ import argparse
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import socket
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
@@ -18,8 +19,6 @@ from pathlib import Path
|
||||
from typing import Any, cast
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import openvino_genai as ov_genai # type: ignore[import-not-found]
|
||||
|
||||
MODEL_ID = "OpenVINO/Qwen2.5-1.5B-Instruct-int4-ov"
|
||||
DEFAULT_MODEL_PATH = "/home/will/models/openvino-genai/Qwen2.5-1.5B-Instruct-int4-ov"
|
||||
DEFAULT_CACHE_DIR = "/home/will/.cache/openvino/genai-npu/qwen2.5-1.5b-int4"
|
||||
@@ -27,6 +26,14 @@ BUSY_PATH = Path("/sys/class/accel/accel0/device/npu_busy_time_us")
|
||||
HOST = "127.0.0.1"
|
||||
PORT = 18820
|
||||
MAX_INPUT_CHARS = 6000
|
||||
MAX_NEW_TOKENS = 256
|
||||
GENAI_CONFIG = {
|
||||
"CACHE_DIR": DEFAULT_CACHE_DIR,
|
||||
"MAX_PROMPT_LEN": 1024,
|
||||
"MIN_RESPONSE_LEN": 64,
|
||||
"PREFILL_HINT": "DYNAMIC",
|
||||
"GENERATE_HINT": "FAST_COMPILE",
|
||||
}
|
||||
DEFAULTS = {
|
||||
"title": 32,
|
||||
"summary": 160,
|
||||
@@ -48,8 +55,20 @@ PROMPTS = {
|
||||
}
|
||||
|
||||
|
||||
def read_busy() -> int:
|
||||
return int(BUSY_PATH.read_text().strip())
|
||||
def import_openvino_genai() -> Any:
|
||||
"""Import OpenVINO GenAI lazily so unit tests do not require the NPU venv."""
|
||||
|
||||
import openvino_genai as ov_genai # type: ignore[import-not-found]
|
||||
|
||||
return ov_genai
|
||||
|
||||
|
||||
def listener_exists(host: str, port: int) -> bool:
|
||||
"""Return True when a TCP listener already accepts connections."""
|
||||
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
|
||||
sock.settimeout(0.2)
|
||||
return sock.connect_ex((host, port)) == 0
|
||||
|
||||
|
||||
def coerce_json(text: str) -> Any | None:
|
||||
@@ -79,9 +98,20 @@ class GenerationResult:
|
||||
|
||||
|
||||
class NpuWorker:
|
||||
def __init__(self, model_path: str, cache_dir: str):
|
||||
def __init__(
|
||||
self,
|
||||
model_path: str,
|
||||
cache_dir: str,
|
||||
*,
|
||||
busy_path: Path = BUSY_PATH,
|
||||
bind_host: str = HOST,
|
||||
bind_port: int = PORT,
|
||||
):
|
||||
self.model_path = Path(model_path)
|
||||
self.cache_dir = Path(cache_dir)
|
||||
self.busy_path = Path(busy_path)
|
||||
self.bind_host = bind_host
|
||||
self.bind_port = bind_port
|
||||
self.cache_dir.mkdir(parents=True, exist_ok=True)
|
||||
self._pipe = None
|
||||
self._load_ms: float | None = None
|
||||
@@ -89,21 +119,20 @@ class NpuWorker:
|
||||
self._loaded_at: float | None = None
|
||||
if not self.model_path.exists():
|
||||
raise FileNotFoundError(f"model path does not exist: {self.model_path}")
|
||||
if not self.busy_path.exists():
|
||||
raise FileNotFoundError(f"NPU busy-time counter does not exist: {self.busy_path}")
|
||||
|
||||
def read_busy(self) -> int:
|
||||
return int(self.busy_path.read_text().strip())
|
||||
|
||||
def load(self) -> None:
|
||||
if self._pipe is not None:
|
||||
return
|
||||
start = time.monotonic()
|
||||
# NPU GenAI requires bounded prompt/response shapes; CACHE_DIR enables compiled blob caching.
|
||||
self._pipe = ov_genai.LLMPipeline(
|
||||
str(self.model_path),
|
||||
"NPU",
|
||||
CACHE_DIR=str(self.cache_dir),
|
||||
MAX_PROMPT_LEN=1024,
|
||||
MIN_RESPONSE_LEN=64,
|
||||
PREFILL_HINT="DYNAMIC",
|
||||
GENERATE_HINT="FAST_COMPILE",
|
||||
)
|
||||
ov_genai = import_openvino_genai()
|
||||
config = GENAI_CONFIG | {"CACHE_DIR": str(self.cache_dir)}
|
||||
self._pipe = ov_genai.LLMPipeline(str(self.model_path), "NPU", **config)
|
||||
self._load_ms = round((time.monotonic() - start) * 1000, 2)
|
||||
self._loaded_at = time.time()
|
||||
|
||||
@@ -115,19 +144,19 @@ class NpuWorker:
|
||||
if len(user_input) > MAX_INPUT_CHARS:
|
||||
raise ValueError(f"input too long: {len(user_input)} chars > {MAX_INPUT_CHARS}")
|
||||
max_new_tokens = int(max_new_tokens or DEFAULTS[job])
|
||||
if max_new_tokens < 1 or max_new_tokens > 256:
|
||||
raise ValueError("max_new_tokens must be between 1 and 256")
|
||||
if max_new_tokens < 1 or max_new_tokens > MAX_NEW_TOKENS:
|
||||
raise ValueError(f"max_new_tokens must be between 1 and {MAX_NEW_TOKENS}")
|
||||
prompt = PROMPTS[job].format(input=user_input.strip())
|
||||
with self._lock:
|
||||
load_start = time.monotonic()
|
||||
self.load()
|
||||
load_ms = round((time.monotonic() - load_start) * 1000, 2)
|
||||
before = read_busy()
|
||||
before = self.read_busy()
|
||||
gen_start = time.monotonic()
|
||||
pipe = cast(Any, self._pipe)
|
||||
text = str(pipe.generate(prompt, max_new_tokens=max_new_tokens)).strip()
|
||||
generate_ms = round((time.monotonic() - gen_start) * 1000, 2)
|
||||
after = read_busy()
|
||||
after = self.read_busy()
|
||||
parsed = coerce_json(text) if job in {"memory_candidate", "notification"} else None
|
||||
if job == "memory_candidate" and isinstance(parsed, list):
|
||||
parsed = {"candidates": parsed, "notes": "model returned a top-level array; worker wrapped it to preserve the API contract"}
|
||||
@@ -151,10 +180,11 @@ class NpuWorker:
|
||||
"loaded": self._pipe is not None,
|
||||
"initial_load_ms": self._load_ms,
|
||||
"loaded_at": self._loaded_at,
|
||||
"busy_time_us": read_busy(),
|
||||
"busy_time_us": self.read_busy(),
|
||||
"max_input_chars": MAX_INPUT_CHARS,
|
||||
"max_new_tokens": MAX_NEW_TOKENS,
|
||||
"jobs": sorted(PROMPTS),
|
||||
"bind": f"{HOST}:{PORT}",
|
||||
"bind": f"{self.bind_host}:{self.bind_port}",
|
||||
}
|
||||
|
||||
|
||||
@@ -175,7 +205,7 @@ def response_payload(worker: NpuWorker, job: str, result: GenerationResult) -> d
|
||||
|
||||
def make_handler(worker: NpuWorker):
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
server_version = "openvino-genai-npu-worker/0.1"
|
||||
server_version = "openvino-genai-npu-worker/0.2"
|
||||
|
||||
def log_message(self, format: str, *args: Any) -> None:
|
||||
# Log only method/path/status metadata, not raw request bodies.
|
||||
@@ -215,7 +245,12 @@ def make_handler(worker: NpuWorker):
|
||||
if job == "memory":
|
||||
job = "memory_candidate"
|
||||
result = worker.generate(job, str(payload.get("input", "")), payload.get("max_new_tokens"))
|
||||
self.send_json(200, response_payload(worker, job, result))
|
||||
body = response_payload(worker, job, result)
|
||||
if result.npu_busy_delta_us <= 0:
|
||||
body["error"] = "NPU busy-time counter did not increase during generation"
|
||||
self.send_json(503, body)
|
||||
return
|
||||
self.send_json(200, body)
|
||||
except Exception as exc:
|
||||
self.send_json(400, {"error": str(exc)})
|
||||
|
||||
@@ -226,21 +261,24 @@ def cli(argv: list[str] | None = None) -> int:
|
||||
parser = argparse.ArgumentParser(description="OpenVINO GenAI NPU worker")
|
||||
parser.add_argument("--model-path", default=os.environ.get("OV_GENAI_NPU_MODEL", DEFAULT_MODEL_PATH))
|
||||
parser.add_argument("--cache-dir", default=os.environ.get("OV_GENAI_NPU_CACHE", DEFAULT_CACHE_DIR))
|
||||
parser.add_argument("--host", default=HOST)
|
||||
parser.add_argument("--host", default=os.environ.get("OV_GENAI_NPU_HOST", HOST))
|
||||
parser.add_argument("--port", type=int, default=int(os.environ.get("OV_GENAI_NPU_PORT", PORT)))
|
||||
parser.add_argument("--job", choices=sorted(PROMPTS), help="Run one CLI job instead of serving HTTP")
|
||||
parser.add_argument("--input", help="Input text for --job")
|
||||
parser.add_argument("--max-new-tokens", type=int)
|
||||
args = parser.parse_args(argv)
|
||||
|
||||
worker = NpuWorker(args.model_path, args.cache_dir)
|
||||
if args.host != "127.0.0.1":
|
||||
raise SystemExit("Refusing non-local bind without code change/explicit approval")
|
||||
|
||||
worker = NpuWorker(args.model_path, args.cache_dir, bind_host=args.host, bind_port=args.port)
|
||||
if args.job:
|
||||
result = worker.generate(args.job, args.input or "", args.max_new_tokens)
|
||||
print(json.dumps(response_payload(worker, args.job, result), indent=2))
|
||||
return 0 if result.npu_busy_delta_us > 0 else 2
|
||||
|
||||
if args.host != "127.0.0.1":
|
||||
raise SystemExit("Refusing non-local bind without code change/explicit approval")
|
||||
if listener_exists(args.host, args.port):
|
||||
raise SystemExit(f"Refusing to start: listener already exists on {args.host}:{args.port}")
|
||||
server = ThreadingHTTPServer((args.host, args.port), make_handler(worker))
|
||||
print(f"serving {MODEL_ID} on http://{args.host}:{args.port}; raw prompts are not logged")
|
||||
server.serve_forever()
|
||||
|
||||
@@ -12,8 +12,10 @@ This service is intentionally not wired into live RAG by default.
|
||||
|
||||
## Files
|
||||
|
||||
- `server.py` — stdlib HTTP OpenVINO Runtime service.
|
||||
- `SPEC.md` — endpoint/CLI contract, model/runtime recommendation, smoke/NPU proof plan, RAG integration plan, docs implications, and no-go criteria.
|
||||
- `server.py` — stdlib HTTP OpenVINO Runtime service with fail-fast localhost listener conflict checks and request validation.
|
||||
- `smoke.py` — non-private API/ranking/NPU busy-time smoke test.
|
||||
- `tests/test_server_validation.py` — stdlib unit checks for request validation and listener conflict detection.
|
||||
- `openvino-reranker.service` — optional user-systemd unit.
|
||||
|
||||
## One-time setup
|
||||
@@ -61,7 +63,7 @@ OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco
|
||||
python /home/will/lab/swarm/openvino-reranker-npu/server.py
|
||||
```
|
||||
|
||||
Startup performs a non-private smoke inference and fails closed when `OPENVINO_RERANKER_DEVICE=NPU` but `npu_busy_time_us` does not increase.
|
||||
Startup performs a non-private smoke inference and fails closed when `OPENVINO_RERANKER_DEVICE=NPU` but `npu_busy_time_us` does not increase. It also checks whether the requested listener can bind before compiling the OpenVINO model, so obvious port conflicts fail fast; the real server bind still happens immediately after model load.
|
||||
|
||||
## API
|
||||
|
||||
@@ -109,6 +111,16 @@ Expected:
|
||||
- The top result matches the non-private fixture expectation.
|
||||
- Response and sysfs `npu_busy_delta_us` are positive.
|
||||
|
||||
## Validation checks
|
||||
|
||||
```bash
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
PYTHONPATH=/home/will/lab/swarm/openvino-reranker-npu \
|
||||
python -m unittest discover -s /home/will/lab/swarm/openvino-reranker-npu/tests
|
||||
```
|
||||
|
||||
These checks do not compile the OpenVINO model; they cover request validation and fail-fast listener conflict detection.
|
||||
|
||||
## Optional systemd user service
|
||||
|
||||
Install the unit only after the foreground command and smoke test pass:
|
||||
|
||||
@@ -0,0 +1,243 @@
|
||||
# OpenVINO NPU reranker service spec
|
||||
|
||||
Status: proposed localhost prototype; not live RAG integration.
|
||||
Target port: `127.0.0.1:18818`.
|
||||
Safety posture: foreground smoke first, no persistent enablement, no Atlas/Hermes/RAG routing changes without Will's explicit approval.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Use `cross-encoder/ms-marco-MiniLM-L6-v2`, exported to OpenVINO IR as INT8, served by the local stdlib HTTP service in `server.py` on OpenVINO Runtime `NPU`.
|
||||
|
||||
Why this choice:
|
||||
|
||||
- It is a small BERT-family cross-encoder reranker intended for MS MARCO-style passage ranking, matching the second-stage RAG use case better than another embedding-only similarity pass.
|
||||
- The model shape is simple pairwise text classification/scoring: `(query, document) -> score`, which maps cleanly to OpenVINO Runtime and avoids introducing a heavier LLM worker for reranking.
|
||||
- INT8 OpenVINO IR keeps memory and compile/runtime cost low enough for a localhost sidecar and is already represented in the repo defaults:
|
||||
`/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov`.
|
||||
- The service can fail closed on startup when `OPENVINO_RERANKER_DEVICE=NPU` but `/sys/class/accel/accel0/device/npu_busy_time_us` does not increase, preventing false "NPU-backed" claims.
|
||||
|
||||
Runtime default:
|
||||
|
||||
```text
|
||||
OPENVINO_RERANKER_HOST=127.0.0.1
|
||||
OPENVINO_RERANKER_PORT=18818
|
||||
OPENVINO_RERANKER_DEVICE=NPU
|
||||
OPENVINO_RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L6-v2
|
||||
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov
|
||||
OPENVINO_RERANKER_MAX_LENGTH=512
|
||||
OPENVINO_RERANKER_MAX_DOCUMENTS=100
|
||||
OPENVINO_RERANKER_MAX_BODY_BYTES=5242880
|
||||
```
|
||||
|
||||
## Endpoint contract
|
||||
|
||||
### Health and readiness
|
||||
|
||||
`GET /healthz` and `GET /readyz` return JSON.
|
||||
|
||||
`/readyz` must return HTTP 200 only when the model is loaded and startup smoke passed. For NPU mode, startup smoke must include a positive `npu_busy_delta_us`.
|
||||
|
||||
Representative ready response:
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"ok": true,
|
||||
"service": "openvino-reranker",
|
||||
"model": "cross-encoder/ms-marco-MiniLM-L6-v2",
|
||||
"model_dir": "/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov",
|
||||
"device": "NPU",
|
||||
"available_devices": ["CPU", "NPU"],
|
||||
"max_length": 512,
|
||||
"startup_smoke": {"ok": true, "duration_ms": 12.3, "npu_busy_delta_us": 1234},
|
||||
"last_inference": null,
|
||||
"ready_error": null
|
||||
}
|
||||
```
|
||||
|
||||
### Rerank
|
||||
|
||||
`POST /rerank` and compatibility alias `POST /v1/rerank` accept:
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "how do I verify OpenVINO NPU usage?",
|
||||
"documents": [
|
||||
{"id": "good", "text": "Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.", "metadata": {"source": "synthetic"}},
|
||||
{"id": "bad", "text": "This note is about making sourdough starter."}
|
||||
],
|
||||
"top_k": 2,
|
||||
"return_documents": false
|
||||
}
|
||||
```
|
||||
|
||||
Compatibility notes:
|
||||
|
||||
- `documents` may be strings or objects with `id`, `text`, and optional object `metadata`.
|
||||
- `top_k` is preferred; `top_n` is accepted for common reranker-client compatibility.
|
||||
- `return_documents=false` is recommended for RAG integration to avoid echoing private source text into logs or intermediate traces.
|
||||
- The optional `model` field may be sent by clients but is not used for routing; this sidecar serves one configured model.
|
||||
|
||||
Successful response:
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"model": "cross-encoder/ms-marco-MiniLM-L6-v2",
|
||||
"device": "NPU",
|
||||
"query": "how do I verify OpenVINO NPU usage?",
|
||||
"input_count": 2,
|
||||
"top_k": 2,
|
||||
"duration_ms": 10.5,
|
||||
"npu_busy_delta_us": 1234,
|
||||
"results": [
|
||||
{"index": 0, "id": "good", "score": 8.1, "raw_score": 8.1, "probability": 0.9997},
|
||||
{"index": 1, "id": "bad", "score": -4.2, "raw_score": -4.2, "probability": 0.0148}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Error response shape:
|
||||
|
||||
```json
|
||||
{"ok": false, "error": "human-readable error", "results": []}
|
||||
```
|
||||
|
||||
Status behavior:
|
||||
|
||||
- 400: invalid JSON schema, empty query, missing/empty documents, invalid document text, or non-positive/non-integer `top_k`/`top_n`.
|
||||
- 413: request body above `OPENVINO_RERANKER_MAX_BODY_BYTES`.
|
||||
- 503: model not ready.
|
||||
- 500: unexpected inference/runtime failure.
|
||||
|
||||
## CLI contract
|
||||
|
||||
Foreground-only review start:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep ':18818\b' || true
|
||||
cat /sys/class/accel/accel0/device/npu_busy_time_us
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
OPENVINO_RERANKER_HOST=127.0.0.1 \
|
||||
OPENVINO_RERANKER_PORT=18818 \
|
||||
OPENVINO_RERANKER_DEVICE=NPU \
|
||||
OPENVINO_RERANKER_MODEL_DIR=/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov \
|
||||
python /home/will/lab/swarm/openvino-reranker-npu/server.py
|
||||
```
|
||||
|
||||
Client smoke:
|
||||
|
||||
```bash
|
||||
source /home/will/.venvs/openvino-reranker/bin/activate
|
||||
python /home/will/lab/swarm/openvino-reranker-npu/smoke.py --url http://127.0.0.1:18818
|
||||
```
|
||||
|
||||
Optional user-systemd unit exists as `openvino-reranker.service`, but this spec does not approve copying, starting, enabling, or wiring it into live paths.
|
||||
|
||||
## Non-private smoke payload
|
||||
|
||||
Use only synthetic public-text fixtures. Do not query the Obsidian vault, private document directories, image folders, or live Chroma documents during smoke.
|
||||
|
||||
Minimum cases:
|
||||
|
||||
1. Query: `how do I verify OpenVINO NPU usage?`
|
||||
- Expected top document: `Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference.`
|
||||
- Distractor: `This note is about making sourdough starter.`
|
||||
2. Query: `what port does the reranker service use?`
|
||||
- Expected top document: `The OpenVINO reranker prototype listens locally on port 18818.`
|
||||
- Distractor: `Whisper transcription accepts audio uploads.`
|
||||
3. Query: `why should reranking not mutate vector collections?`
|
||||
- Expected top document: `Reranking is a read-only second-stage transformation after vector search.`
|
||||
- Distractor: `Boil pasta in salted water until al dente.`
|
||||
|
||||
Pass criteria:
|
||||
|
||||
- `/readyz` is HTTP 200 and reports `device=NPU`.
|
||||
- Every case returns `ok=true` and a sorted `results` list with the expected top `id`.
|
||||
- Response-level `npu_busy_delta_us` is positive for each case.
|
||||
- External sysfs `after - before` is positive for each case or at least for the full smoke batch.
|
||||
- Smoke script exits 0 and prints JSON with `ok: true`.
|
||||
|
||||
## NPU busy-time verification plan
|
||||
|
||||
HTTP 200 is not proof. Verification must capture both endpoint-reported and sysfs-observed deltas.
|
||||
|
||||
Procedure:
|
||||
|
||||
```bash
|
||||
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
||||
before=$(cat "$BUSY")
|
||||
curl -fsS http://127.0.0.1:18818/rerank \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"query":"how do I verify OpenVINO NPU usage?","documents":[{"id":"good","text":"Check /sys/class/accel/accel0/device/npu_busy_time_us before and after inference."},{"id":"bad","text":"This note is about making sourdough starter."}],"top_k":2,"return_documents":false}' \
|
||||
| jq '{ok, device, npu_busy_delta_us, top_id:.results[0].id}'
|
||||
after=$(cat "$BUSY")
|
||||
echo "sysfs_npu_busy_delta_us=$((after-before))"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `device == "NPU"`.
|
||||
- Response `npu_busy_delta_us > 0`.
|
||||
- Shell-computed `sysfs_npu_busy_delta_us > 0`.
|
||||
- If any value is zero/negative/missing, call the result CPU/unknown and do not claim NPU-backed reranking.
|
||||
|
||||
## Optional RAG second-stage integration plan (deferred)
|
||||
|
||||
This is a plan only. Do not enable it in live RAG without explicit approval.
|
||||
|
||||
Design:
|
||||
|
||||
1. Keep existing vector search and Chroma collection `obsidian_bge_npu` unchanged.
|
||||
2. Retrieve more candidates from current vector search, e.g. `initial_k=20`.
|
||||
3. Send only request-time candidate snippets/ids to `http://127.0.0.1:18818/rerank`.
|
||||
4. Use reranker order to choose final `top_k`, e.g. `5`.
|
||||
5. On timeout, connection error, invalid response, or non-positive NPU proof when proof is required, fall back to vector order and attach metadata like `rerank_error`; do not fail the whole RAG request unless explicitly configured.
|
||||
6. Log counters and latency, but avoid logging raw private document text.
|
||||
|
||||
Disabled-by-default knobs:
|
||||
|
||||
```text
|
||||
RAG_RERANK_ENABLED=false
|
||||
RAG_RERANK_URL=http://127.0.0.1:18818/rerank
|
||||
RAG_RERANK_INITIAL_K=20
|
||||
RAG_RERANK_TOP_K=5
|
||||
RAG_RERANK_TIMEOUT_MS=3000
|
||||
RAG_RERANK_REQUIRE_NPU_PROOF=true
|
||||
RAG_RERANK_RETURN_DOCUMENTS=false
|
||||
```
|
||||
|
||||
Integration tests should use synthetic in-memory candidates first. Live-vault evaluation requires a separate approval and must not mutate or rebuild the vector collection.
|
||||
|
||||
## Docs and diagram implications
|
||||
|
||||
If this prototype advances beyond spec/review, update these surfaces while keeping live/prototype labels clear:
|
||||
|
||||
- `openvino-reranker-npu/README.md`: keep model/runtime, endpoint contract, smoke command, and approval gates synchronized with code.
|
||||
- `swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md`: list `:18818` as prototype/not enabled, with foreground smoke and NPU sysfs proof.
|
||||
- Service catalog / architecture notes: show live baseline `:18810`, `:18816`, `:18817`; show `:18818` as optional second-stage RAG prototype, not live routing.
|
||||
- Diagrams: render `RAG :18810 -> optional reranker :18818` as dashed/disabled or "proposed"; do not imply Atlas/Hermes/gateway traffic is using it.
|
||||
- Optional systemd unit: document as installable after approval, not enabled by default.
|
||||
|
||||
## No-go / defer criteria
|
||||
|
||||
Do not ship, enable, or integrate the reranker if any of these hold:
|
||||
|
||||
- Port `18818` is already owned by another live service.
|
||||
- `NPU` is unavailable in `ov.Core().available_devices` or `/sys/class/accel/accel0/device/npu_busy_time_us` is missing.
|
||||
- Foreground startup smoke fails or has non-positive NPU busy-time delta while configured for NPU.
|
||||
- Synthetic smoke top-1 ranking fails or latency is unacceptable for the intended RAG timeout budget.
|
||||
- Model export requires overwriting the existing model directory or touching Chroma/vector collections.
|
||||
- The service must bind beyond `127.0.0.1` to be useful.
|
||||
- Live RAG integration would require reindexing, collection mutation, private-doc smoke, or Atlas/Hermes/gateway routing changes without explicit approval.
|
||||
- Logs or responses would persist raw private document text outside the existing RAG request path.
|
||||
|
||||
## Current local preflight observed during this spec pass
|
||||
|
||||
- `/sys/class/accel/accel0/device/npu_busy_time_us` is readable.
|
||||
- `/home/will/.cache/openvino-models/rerankers/ms-marco-MiniLM-L6-v2-int8-ov` is present.
|
||||
- `/home/will/.venvs/openvino-reranker/bin/python` is present.
|
||||
- `:18818` was not listening during preflight.
|
||||
- `server.py` and `smoke.py` pass `python -m py_compile`.
|
||||
|
||||
These observations are preflight only; they are not a live service/NPU smoke result.
|
||||
@@ -16,6 +16,7 @@ import argparse
|
||||
import json
|
||||
import math
|
||||
import os
|
||||
import socket
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
@@ -251,6 +252,27 @@ def normalize_documents(value: Any, max_documents: int) -> list[dict[str, Any]]:
|
||||
return docs
|
||||
|
||||
|
||||
def parse_top_k(value: Any, document_count: int) -> int:
|
||||
"""Validate top_k/top_n before inference so schema errors return HTTP 400."""
|
||||
if value is None:
|
||||
return document_count
|
||||
if isinstance(value, bool) or not isinstance(value, int):
|
||||
raise ValueError("top_k/top_n must be a positive integer")
|
||||
if value < 1:
|
||||
raise ValueError("top_k/top_n must be a positive integer")
|
||||
return min(value, document_count)
|
||||
|
||||
|
||||
def assert_port_available(host: str, port: int) -> None:
|
||||
"""Fail fast on listener conflicts before compiling the OpenVINO model."""
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
|
||||
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
|
||||
try:
|
||||
sock.bind((host, port))
|
||||
except OSError as exc:
|
||||
raise RuntimeError(f"cannot bind {host}:{port}; listener conflict or invalid bind: {exc}") from exc
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
server_version = "OpenVINOReranker/0.1"
|
||||
|
||||
@@ -293,6 +315,7 @@ class Handler(BaseHTTPRequestHandler):
|
||||
raise ValueError("query is required")
|
||||
top_k = payload.get("top_k", payload.get("top_n"))
|
||||
documents = normalize_documents(payload.get("documents"), self.max_documents)
|
||||
top_k = parse_top_k(top_k, len(documents))
|
||||
return_documents = bool(payload.get("return_documents", True))
|
||||
response = self.svc.rerank(query.strip(), documents, top_k=top_k, return_documents=return_documents)
|
||||
self.write_json(response)
|
||||
@@ -342,6 +365,7 @@ def main() -> int:
|
||||
parser.add_argument("--skip-startup-smoke", action="store_true", default=os.environ.get("OPENVINO_RERANKER_SKIP_STARTUP_SMOKE", "").lower() in {"1", "true", "yes"})
|
||||
args = parser.parse_args()
|
||||
|
||||
assert_port_available(args.host, args.port)
|
||||
service = RerankerService(
|
||||
Path(args.model_dir).expanduser(),
|
||||
args.model,
|
||||
|
||||
@@ -0,0 +1,55 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Unit checks for reranker request validation helpers.
|
||||
|
||||
These tests intentionally avoid loading an OpenVINO model; they only cover the
|
||||
stdlib validation helpers used before inference.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import socket
|
||||
import unittest
|
||||
|
||||
from server import assert_port_available, normalize_documents, parse_top_k
|
||||
|
||||
|
||||
class ValidationTests(unittest.TestCase):
|
||||
def test_normalize_accepts_strings_and_objects(self) -> None:
|
||||
docs = normalize_documents(
|
||||
[
|
||||
"plain text document",
|
||||
{"id": "obj", "text": "object document", "metadata": {"source": "synthetic"}},
|
||||
],
|
||||
max_documents=2,
|
||||
)
|
||||
self.assertEqual(docs[0], {"text": "plain text document"})
|
||||
self.assertEqual(docs[1]["id"], "obj")
|
||||
self.assertEqual(docs[1]["metadata"], {"source": "synthetic"})
|
||||
|
||||
def test_normalize_rejects_empty_or_too_many_documents(self) -> None:
|
||||
with self.assertRaisesRegex(ValueError, "non-empty"):
|
||||
normalize_documents([], max_documents=2)
|
||||
with self.assertRaisesRegex(ValueError, "max_documents"):
|
||||
normalize_documents(["a", "b", "c"], max_documents=2)
|
||||
with self.assertRaisesRegex(ValueError, "non-empty string"):
|
||||
normalize_documents([{"id": "empty", "text": ""}], max_documents=2)
|
||||
|
||||
def test_parse_top_k_defaults_clamps_and_rejects_invalid_values(self) -> None:
|
||||
self.assertEqual(parse_top_k(None, document_count=3), 3)
|
||||
self.assertEqual(parse_top_k(2, document_count=3), 2)
|
||||
self.assertEqual(parse_top_k(99, document_count=3), 3)
|
||||
for value in (0, -1, True, False, 1.5, "2", "nope"):
|
||||
with self.subTest(value=value):
|
||||
with self.assertRaisesRegex(ValueError, "positive integer"):
|
||||
parse_top_k(value, document_count=3)
|
||||
|
||||
def test_assert_port_available_detects_listener_conflict(self) -> None:
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as listener:
|
||||
listener.bind(("127.0.0.1", 0))
|
||||
listener.listen(1)
|
||||
port = listener.getsockname()[1]
|
||||
with self.assertRaisesRegex(RuntimeError, "cannot bind"):
|
||||
assert_port_available("127.0.0.1", port)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
@@ -45,6 +45,10 @@ printf 'busy_path=%s\n' "$BUSY_PATH"
|
||||
printf 'busy_time_us=%s\n' "$(busy_value)"
|
||||
|
||||
section "Listeners"
|
||||
# Required OpenVINO/NPU program ports: live baseline 18810/18816/18817,
|
||||
# approved prototypes 18818/18819/18820, and optional doc/image triage 18829.
|
||||
# 18814 is the existing RAG/embedding health wrapper; 18828 is a review-only
|
||||
# alternate used to avoid collisions during prior smoke tests.
|
||||
ss -ltnp | grep -E ':(18810|18814|18816|18817|18818|18819|18820|18828|18829)\b' || true
|
||||
|
||||
section "User service states"
|
||||
@@ -73,6 +77,7 @@ http_json "OpenVINO embeddings" "http://127.0.0.1:18817/healthz" || true
|
||||
http_json "NPU reranker prototype" "http://127.0.0.1:18818/readyz" || true
|
||||
http_json "NPU router classifier prototype" "http://127.0.0.1:18819/healthz" || true
|
||||
http_json "NPU GenAI worker prototype" "http://127.0.0.1:18820/healthz" || true
|
||||
http_json "NPU doc/image triage prototype" "http://127.0.0.1:18829/healthz" || true
|
||||
|
||||
section "Embeddings NPU busy-time proof"
|
||||
if [[ ! -r "$BUSY_PATH" ]]; then
|
||||
|
||||
@@ -257,9 +257,9 @@ Profile Model Gateway Alias Distribu
|
||||
| Web search | SearXNG `18803` or Brave MCP `18802` | Hermes web search and MCP Brave Search are both available |
|
||||
| Model proxy | LiteLLM `18804` | Use for OpenAI-compatible routed models |
|
||||
| Direct local LLM | llama.cpp `18806` | Current model id: `gemma-4-26B-A4B-it-UD-IQ2_M.gguf`; useful for n8n/local automation |
|
||||
| Embeddings | Ollama `18807` | Use raw Ollama API root, not `/v1`, for `/api/embed` |
|
||||
| Embeddings | OpenVINO NPU `18817`; Ollama `18807` fallback | Live RAG uses `bge-base-en-v1.5-int8-ov` via OpenVINO and collection `obsidian_bge_npu`; Ollama remains a legacy/CPU fallback |
|
||||
| Text-to-speech | Kokoro `18805` / Hermes TTS tool | Local speech generation |
|
||||
| Speech-to-text | Whisper `18811` and wrappers | Local transcription fallback |
|
||||
| Speech-to-text | Whisper OpenVINO NPU `18816`; Whisper CPU `18811` fallback | NPU service is the live default; CPU remains fallback |
|
||||
| Workflow automation | n8n `18808` | Durable jobs and webhooks |
|
||||
| Knowledge store | Obsidian REST `27123`; RAG/Chroma local store | Obsidian notes plus Hermes rag-search index |
|
||||
|
||||
@@ -293,6 +293,7 @@ Profile Model Gateway Alias Distribu
|
||||
- Use file-based workflow updates for large n8n JSON payloads.
|
||||
- After structural n8n workflow edits, deactivate/reactivate the workflow.
|
||||
- Prefer `make` targets in `~/lab/swarm` for routine service operations.
|
||||
- OpenVINO NPU prototype sidecars `:18818`, `:18819`, `:18820`, and optional `:18829` are approved prototypes only; do not enable persistent services, live Atlas/Hermes/RAG routing, vector DB mutation, or private document/image processing without explicit approval. Verify NPU usage with `/sys/class/accel/accel0/device/npu_busy_time_us`; HTTP 200 alone is not proof.
|
||||
- Check git status before committing; commit only targeted non-secret source/config/docs.
|
||||
|
||||
## Refresh procedure
|
||||
|
||||
+31
-13
@@ -35,15 +35,15 @@ Safety posture:
|
||||
| Obsidian/RAG endpoint | 18810 | `obsidian-reindex-endpoint.service` / local Python endpoint | `~/lab/swarm/scripts/` | live baseline; uses collection `obsidian_bge_npu` | `http://127.0.0.1:18810/healthz` | indirect via embeddings `:18817`; do not mutate existing collection |
|
||||
| RAG/embedding health wrapper | 18814 | `rag-embedding-health.service` | `~/lab/swarm/swarm-common/rag-embedding-health.service` | live baseline | `http://127.0.0.1:18814/healthz` | should exercise embeddings path when configured |
|
||||
| Whisper transcription, OpenVINO NPU | 18816 | Docker Compose service/container `whisper-server-npu` | `~/lab/swarm/whisper-openvino-npu/` | live baseline | `http://127.0.0.1:18816/health` | transcription response includes `npu_busy_delta_us`; sysfs delta must increase |
|
||||
| OpenVINO embeddings | 18817 | user systemd `openvino-embeddings.service` | `~/lab/swarm/scripts/openvino-embeddings-server.py`; unit in `~/lab/swarm/swarm-common/openvino-embeddings.service` | live baseline, enabled | `http://127.0.0.1:18817/health` | embedding response and sysfs delta must be positive |
|
||||
| OpenVINO embeddings | 18817 | user systemd `openvino-embeddings.service` | `~/lab/swarm/scripts/openvino-embeddings-server.py`; unit in `~/lab/swarm/swarm-common/openvino-embeddings.service` | live baseline, enabled | `http://127.0.0.1:18817/healthz` | embedding response and sysfs delta must be positive |
|
||||
| NPU reranker prototype | 18818 | optional user systemd `openvino-reranker.service` | `~/lab/swarm/openvino-reranker-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18818/readyz` | `/readyz` reports `device=NPU`; `/v1/rerank` response and sysfs delta must be positive |
|
||||
| NPU router/classifier prototype | 18819 | optional user systemd `openvino-router-classifier.service` | `~/lab/swarm/openvino-classifier-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18819/healthz` | `/v1/classify` response has positive `npu_busy_delta_us` and `sysfs_npu_busy_delta_us` |
|
||||
| Small OpenVINO GenAI NPU worker | 18820 | optional user systemd `openvino-genai-npu-worker.service` | `~/lab/swarm/openvino-genai-npu-worker/` | approved prototype; not installed/enabled | `http://127.0.0.1:18820/healthz`; `GET /models` | generation response includes positive `npu_busy_delta_us` |
|
||||
| Document/image triage prototype | 18828 or 18829 for review only | foreground local-only server; no persistent unit yet | `~/lab/swarm/openvino-doc-image-triage-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:<port>/healthz`; `GET /models` | v1 NPU stage is semantic embedding through `:18817`; image classification/OCR remain CPU/local |
|
||||
| Document/image triage prototype | optional 18829 for review only; 18828 was an earlier smoke alternate | CLI-first; foreground local-only server if needed; no persistent unit yet | `~/lab/swarm/openvino-doc-image-triage-npu/` | approved prototype; not installed/enabled | `http://127.0.0.1:18829/healthz`; `GET /models` | v1 NPU stage is semantic embedding through `:18817`; image classification/OCR remain CPU/local |
|
||||
|
||||
Port notes:
|
||||
- `18818`, `18819`, and `18820` are reserved prototype ports from the program plan; check listeners before binding.
|
||||
- `18820` was used by the GenAI worker prototype. The document/image triage prototype README still contains a `18820` example, but review used `18828`/`18829` to avoid collision. Prefer `18828`/`18829` for triage foreground review until Will approves a final persistent port.
|
||||
- `18820` is reserved for the GenAI worker prototype. Use optional `18829` for document/image triage foreground review until Will approves a final persistent port. `18828` was used in earlier review smoke only and should not be treated as the preferred documented port.
|
||||
- Existing `:18817` is currently bound on `0.0.0.0` by the user service; prototype services should still default to `127.0.0.1`.
|
||||
|
||||
## Read-only unified health check
|
||||
@@ -55,17 +55,17 @@ cd ~/lab/swarm
|
||||
./scripts/npu-service-health.sh
|
||||
```
|
||||
|
||||
The script is read-only. It checks listeners, user service state, Docker Compose state for `whisper-server-npu`, JSON health endpoints, and performs a non-private embeddings request while measuring `/sys/class/accel/accel0/device/npu_busy_time_us` before and after. A positive sysfs delta is required for the embeddings proof.
|
||||
The script is read-only. It checks listeners for `18810`, `18816`, `18817`, `18818`, `18819`, `18820`, `18829` plus the existing `18814` wrapper and `18828` review alternate, user service state, Docker Compose state for `whisper-server-npu`, JSON health endpoints, and performs a non-private embeddings request while measuring `/sys/class/accel/accel0/device/npu_busy_time_us` before and after. A positive sysfs delta is required for the embeddings proof.
|
||||
|
||||
Manual minimal checks:
|
||||
|
||||
```bash
|
||||
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
|
||||
cat "$BUSY"
|
||||
ss -ltnp | grep -E ':(18810|18814|18816|18817|18818|18819|18820|18828|18829)\b' || true
|
||||
ss -ltnp | grep -E ':(18810|18816|18817|18818|18819|18820|18829)\b' || true
|
||||
systemctl --user is-active openvino-embeddings.service rag-embedding-health.service
|
||||
cd ~/lab/swarm && docker compose ps whisper-server-npu
|
||||
curl -fsS http://127.0.0.1:18817/health | jq .
|
||||
curl -fsS http://127.0.0.1:18817/healthz | jq .
|
||||
```
|
||||
|
||||
Embedding NPU proof:
|
||||
@@ -87,6 +87,24 @@ A healthy NPU path has:
|
||||
|
||||
## Service-specific smoke checks
|
||||
|
||||
For any foreground prototype server below, run it in a terminal you control or capture its PID and stop it at the end of the smoke. Do not use `systemctl --user enable`, Docker Compose `up -d`, `nohup`, or shell disowning for these review smokes unless Will explicitly approved persistent service enablement.
|
||||
|
||||
Safe foreground-server pattern:
|
||||
|
||||
```bash
|
||||
server_pid=""
|
||||
cleanup() {
|
||||
if [[ -n "$server_pid" ]] && kill -0 "$server_pid" 2>/dev/null; then
|
||||
kill "$server_pid"
|
||||
wait "$server_pid" 2>/dev/null || true
|
||||
fi
|
||||
}
|
||||
trap cleanup EXIT
|
||||
# start prototype server with --host 127.0.0.1 --port <port> &
|
||||
# server_pid=$!
|
||||
# run curl/smoke commands, then let trap stop it
|
||||
```
|
||||
|
||||
### Whisper NPU (`:18816`)
|
||||
|
||||
```bash
|
||||
@@ -104,7 +122,7 @@ Operational notes:
|
||||
|
||||
```bash
|
||||
systemctl --user status openvino-embeddings.service --no-pager
|
||||
curl -fsS http://127.0.0.1:18817/health | jq .
|
||||
curl -fsS http://127.0.0.1:18817/healthz | jq .
|
||||
```
|
||||
|
||||
Operational notes:
|
||||
@@ -186,21 +204,21 @@ Approval gate:
|
||||
- May be installed as `openvino-genai-npu-worker.service` only after Will approves persistent service enablement.
|
||||
- Must not become primary Atlas/Hermes model routing. Use only for bounded background jobs such as title, summary, notification condensation, and memory-candidate drafting.
|
||||
|
||||
### Document/image triage prototype (`:18828`/`:18829` review ports)
|
||||
### Document/image triage prototype (`:18829` optional review port)
|
||||
|
||||
Foreground review start only, after confirming port is free:
|
||||
Foreground review start only, after confirming the port is free:
|
||||
|
||||
```bash
|
||||
ss -ltnp | grep -E ':(18828|18829)\b' || true
|
||||
ss -ltnp | grep ':18829\b' || true
|
||||
cd ~/lab/swarm/openvino-doc-image-triage-npu
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18828 --allowed-root "$PWD"
|
||||
/home/will/.venvs/npu/bin/python server.py --host 127.0.0.1 --port 18829 --allowed-root "$PWD"
|
||||
```
|
||||
|
||||
Smoke:
|
||||
|
||||
```bash
|
||||
curl -fsS http://127.0.0.1:18828/healthz | jq .
|
||||
curl -fsS http://127.0.0.1:18828/models | jq .
|
||||
curl -fsS http://127.0.0.1:18829/healthz | jq .
|
||||
curl -fsS http://127.0.0.1:18829/models | jq .
|
||||
/home/will/.venvs/npu/bin/python tests/smoke_test.py
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user