13 KiB
OpenVINO NPU classifier/router dry-run contract
Status: specification for dry-run prototype refresh
Target port: 127.0.0.1:18819
Owner context: Atlas/Hermes local assistant sidecar evaluation
This service is an advisory classifier for Atlas/Hermes automation hints. It may suggest labels such as tool-needed, memory-candidate type, urgency, workflow category, and safety-confirmation-required, but it must not make or enforce live routing, memory, tool, or safety decisions without a separate explicit approval from Will.
Recommended model and runtime
Recommended v1 runtime: small local Python HTTP/CLI service backed by the existing OpenVINO NPU embeddings service on 127.0.0.1:18817.
Recommended v1 model shape:
- Primary signal:
bge-base-en-v1.5-int8-ovembeddings from the live embeddings service. - Classifier layer: inspectable deterministic rules plus cosine similarity against curated synthetic/prototype utterances.
- Model label:
bge-base-en-v1.5-int8-ov/prototype-router-v0. - Device proof: request-level
npu_busy_delta_usfrom:18817plus direct sysfs before/after reads from/sys/class/accel/accel0/device/npu_busy_time_us.
Why this is preferred for the dry run:
- It reuses the already-live NPU embeddings path rather than adding a second model conversion/runtime dependency before contract validation.
- Rules and prototypes are transparent enough for safety-sensitive routing hints; a reviewer can inspect why a message was labeled.
- It avoids fine-tuning or training on private Atlas/Hermes transcripts.
- It keeps the service small, localhost-only, and easy to start/stop during smoke tests.
- It produces NPU activity through the embeddings path while making clear that final decision logic remains advisory.
Defer a dedicated NPU sequence-classification model such as TinyBERT/MiniLM until the dry-run labels and thresholds have been evaluated against synthetic fixtures and explicitly-approved non-private examples. If pursued later, use OpenVINO Runtime/Optimum export with fixed input shapes suitable for NPU, and keep the rule layer for safety gates.
Non-goals and safety invariants
The service must not:
- Change Hermes/Atlas model routing, gateway routing, memory writes, tool-use permissions, or safety-confirmation behavior.
- Restart, stop, enable, or persist any live Atlas/Hermes/gateway/RAG service.
- Bind to anything broader than
127.0.0.1by default. - Mutate Chroma/vector collections, trigger reindexing, or write to RAG state.
- Process private document/image directories or private transcript dumps for smoke testing.
- Log raw prompts by default beyond normal foreground stderr during local review.
- Claim NPU success from HTTP 200 alone.
Endpoint contract
All HTTP endpoints are local-only by default.
Base URL:
http://127.0.0.1:18819
GET /healthz, /health, /readyz, /
Purpose: liveness/readiness metadata.
Response fields:
status:starting | okservice:atlas-router-classifierversion: service version stringmode: alwaysdry_runmodel: model/runtime labelembed_url: upstream embeddings URLdevice: expected to sayNPU-via-embedding-serviceor equivalentlabels: supported label namesembedding_dim: embedding dimension after warmupprototype_count: number of synthetic prototype examples loadedprototype_npu_busy_delta_us: warmup delta reported by upstream embeddings, if availablenpu_busy_time_us: current sysfs counter value, if readablewarnings: list of non-fatal warnings
A healthy service is not enough to prove NPU execution. At least one classification request must also show positive request and sysfs busy deltas.
GET /v1/labels
Purpose: publish schema information without dumping private examples.
Response fields:
modelthresholdstool_needed: recommended threshold0.72memory_candidate: recommended threshold0.78safety_confirmation_required: recommended threshold0.80workflow_category: recommended threshold0.52
enumsmemory_candidate:none,user_preference,durable_user_fact,environment_fact,workflow_convention,skill_candidateurgency:low,normal,high,criticalworkflow_category:chat,research,coding,debugging,devops,smart_home,media,note_taking,productivity,kanban,unknown
prototype_ids: names of curated synthetic prototype buckets
POST /v1/classify
Purpose: classify one user/task message for advisory dry-run hints.
Request:
{
"id": "optional-trace-id",
"text": "Urgent: check whether port 18817 is listening and inspect systemd logs.",
"context": {
"platform": "cli",
"source": "user"
},
"options": {
"include_evidence": true,
"include_embedding_debug": false,
"dry_run": true
}
}
Required behavior:
- Reject empty text with HTTP 400.
- Default
dry_runto true. - Return no side effects other than local inference and response generation.
- Include evidence by default unless
include_evidence=false. - Include embedding/prototype scores only when explicitly requested through
include_embedding_debug=true.
Response:
{
"id": "optional-trace-id",
"model": "bge-base-en-v1.5-int8-ov/prototype-router-v0",
"created": 1780590000,
"duration_ms": 12.3,
"npu_busy_delta_us": 1234,
"sysfs_npu_busy_delta_us": 1200,
"dry_run": true,
"labels": {
"tool_needed": {
"value": true,
"confidence": 0.84,
"threshold": 0.72,
"reason_codes": ["local_state_requested"]
},
"memory_candidate": {
"value": "none",
"confidence": 0.31,
"threshold": 0.78,
"reason_codes": []
},
"urgency": {
"value": "high",
"confidence": 0.84,
"scores": {"low": 0.0, "normal": 0.2, "high": 0.84, "critical": 0.0},
"reason_codes": ["urgent_language"]
},
"workflow_category": {
"value": "devops",
"confidence": 0.86,
"scores": {"devops": 0.86, "unknown": 0.14}
},
"safety_confirmation_required": {
"value": false,
"confidence": 0.0,
"threshold": 0.8,
"reason_codes": []
}
},
"warnings": [],
"evidence": []
}
POST /v1/batch_classify
Purpose: classify a bounded batch of non-private synthetic or explicitly-approved messages.
Request:
{
"items": [
{"id": "m1", "text": "What time is it in Seattle right now?"},
{"id": "m2", "text": "Restart the live Atlas gateway and switch primary routing."}
],
"options": {"include_evidence": false, "dry_run": true}
}
Response:
modelduration_ms- aggregate
npu_busy_delta_us results: array of/v1/classifyresponses
Batch limits for prototype review:
- Keep batches small, ideally <= 32 items.
- Use only synthetic fixtures unless Will explicitly approves a real non-private sample set.
- Do not retain request bodies to disk.
CLI contract
The same implementation should support foreground review from the service directory:
cd /home/will/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py \
--host 127.0.0.1 \
--port 18819 \
--embed-url http://127.0.0.1:18817/v1/embeddings
Required flags/env:
--host/OPENVINO_CLASSIFIER_HOST; default127.0.0.1.--port/OPENVINO_CLASSIFIER_PORT; default18819.--embed-url/OPENVINO_CLASSIFIER_EMBED_URL; defaulthttp://127.0.0.1:18817/v1/embeddings.--timeout-s/OPENVINO_CLASSIFIER_TIMEOUT_S; default30.--no-warmupto defer prototype embedding until first request.
A future dedicated CLI mode may be added for one-shot JSONL classification, but foreground HTTP review is sufficient for the dry-run contract.
Synthetic smoke-test plan
Preconditions:
- Confirm
:18817embeddings service is healthy. - Confirm
:18819is not already listening. - Read
/sys/class/accel/accel0/device/npu_busy_time_usbefore starting the request smoke. - Use only synthetic fixture text such as
fixtures/atlas_hermes_messages.jsonl.
Unit/schema smoke, no NPU dependency:
cd /home/will/lab/swarm
/home/will/.venvs/npu/bin/python -m unittest discover -s openvino-classifier-npu/tests -v
Foreground service smoke:
ss -ltnp | grep ':18819\b' || true
cd /home/will/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819
From another shell:
curl -fsS http://127.0.0.1:18819/healthz | jq .
curl -fsS http://127.0.0.1:18819/v1/labels | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
-H 'Content-Type: application/json' \
-d '{"id":"smoke-devops","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":true,"dry_run":true}}' | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
-H 'Content-Type: application/json' \
-d '{"id":"smoke-safety","text":"Restart the live Atlas gateway and switch primary routing to the new classifier.","options":{"include_evidence":true,"dry_run":true}}' | jq .
Expected label checks:
smoke-devops:tool_needed.value=true,urgency.value=high,workflow_category.value=devops.smoke-safety:safety_confirmation_required.value=true, no actual restart or routing change.- Health and classify responses include no raw private paths or private document content.
Shutdown:
- Stop the foreground server with Ctrl-C.
- Re-run
ss -ltnp | grep ':18819\b' || trueand confirm no listener remains.
NPU busy-time verification plan
Use sysfs plus service response fields; do not accept HTTP 200 alone.
BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
before=$(cat "$BUSY")
response=$(curl -fsS http://127.0.0.1:18819/v1/classify \
-H 'Content-Type: application/json' \
-d '{"id":"npu-proof","text":"Check current systemd service status for the embeddings service.","options":{"include_evidence":false,"dry_run":true}}')
after=$(cat "$BUSY")
echo "$response" | jq '{npu_busy_delta_us, sysfs_npu_busy_delta_us, warnings}'
echo "outer_sysfs_npu_busy_delta_us=$((after-before))"
Acceptance for an NPU-backed classification request:
- HTTP request succeeds.
- Response
npu_busy_delta_us > 0from upstream embeddings. - Response
sysfs_npu_busy_delta_us > 0when sysfs is readable. - Outer shell
after-before > 0. - If any delta is missing or <= 0, mark NPU proof failed or inconclusive and do not claim NPU execution.
Docs and diagram implications
If this prototype is refreshed or reviewed, update documentation to show:
- Live baseline remains RAG
:18810, RAG health:18814, Whisper NPU:18816, and embeddings:18817. - Classifier/router
:18819is an optional prototype sidecar, not a live Atlas/Hermes routing dependency. - Any architecture diagram should place
:18819under local AI/search/voice prototype sidecars with a cleardry-run / not live routinglabel. - Runbooks should list foreground start, health/classify smoke, sysfs NPU proof, and shutdown checks.
- Service catalog entries should state
not installed/enableduntil Will approves persistent service enablement. - No docs should imply the classifier decides memory writes, tool permission, safety confirmation, or live routing.
Relevant docs inventory:
docs/swarm-infrastructure.mddocs/swarm-infrastructure.htmldocs/diagram-maintenance.mdswarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.mdswarm-common/obsidian-vault/will/will-shared-zap/Resources/Service Catalog.md
No-go / defer criteria
Do not proceed to implementation refresh, persistent service enablement, or live integration if any of the following hold:
:18817embeddings is unavailable and no approved NPU embedding fallback exists./sys/class/accel/accel0/device/npu_busy_time_usis missing/unreadable and NPU proof cannot be independently established.- Classification responses cannot produce positive NPU busy-time deltas.
:18819is already occupied by an unknown or live service.- Smoke tests require private transcripts, private document/image directories, or production routing changes.
- Labels are too noisy on synthetic fixtures to be useful as advisory hints.
- The service would need to bind externally, run persistently, or integrate with live Hermes/Atlas before Will approves those gates.
- Any implementation path requires mutating Chroma/vector collections or triggering RAG reindexing in place.
Implementation handoff notes
Recommended next engineer actions:
- Verify or refresh
openvino-classifier-npu/router_classifier.pyto match this contract. - Keep the service stdlib/local-first unless a dependency is already present in
/home/will/.venvs/npu. - Maintain synthetic fixtures and unit tests for label schema/threshold behavior.
- Run only foreground smokes; do not install or enable
openvino-router-classifier.service. - Capture changed files, unit test output, listener checks, response samples, and NPU busy-time before/after in the implementation handoff.