Files

T

William Valentin ea452886f3 feat(npu): add dry-run classifier router prototype

2026-06-04 13:07:51 -07:00

13 KiB

Raw Permalink Blame History

OpenVINO NPU classifier/router dry-run contract

Status: specification for dry-run prototype refresh Target port: 127.0.0.1:18819 Owner context: Atlas/Hermes local assistant sidecar evaluation

This service is an advisory classifier for Atlas/Hermes automation hints. It may suggest labels such as tool-needed, memory-candidate type, urgency, workflow category, and safety-confirmation-required, but it must not make or enforce live routing, memory, tool, or safety decisions without a separate explicit approval from Will.

Recommended model and runtime

Recommended v1 runtime: small local Python HTTP/CLI service backed by the existing OpenVINO NPU embeddings service on 127.0.0.1:18817.

Recommended v1 model shape:

Primary signal: bge-base-en-v1.5-int8-ov embeddings from the live embeddings service.
Classifier layer: inspectable deterministic rules plus cosine similarity against curated synthetic/prototype utterances.
Model label: bge-base-en-v1.5-int8-ov/prototype-router-v0.
Device proof: request-level npu_busy_delta_us from :18817 plus direct sysfs before/after reads from /sys/class/accel/accel0/device/npu_busy_time_us.

Why this is preferred for the dry run:

It reuses the already-live NPU embeddings path rather than adding a second model conversion/runtime dependency before contract validation.
Rules and prototypes are transparent enough for safety-sensitive routing hints; a reviewer can inspect why a message was labeled.
It avoids fine-tuning or training on private Atlas/Hermes transcripts.
It keeps the service small, localhost-only, and easy to start/stop during smoke tests.
It produces NPU activity through the embeddings path while making clear that final decision logic remains advisory.

Defer a dedicated NPU sequence-classification model such as TinyBERT/MiniLM until the dry-run labels and thresholds have been evaluated against synthetic fixtures and explicitly-approved non-private examples. If pursued later, use OpenVINO Runtime/Optimum export with fixed input shapes suitable for NPU, and keep the rule layer for safety gates.

Non-goals and safety invariants

The service must not:

Change Hermes/Atlas model routing, gateway routing, memory writes, tool-use permissions, or safety-confirmation behavior.
Restart, stop, enable, or persist any live Atlas/Hermes/gateway/RAG service.
Bind to anything broader than 127.0.0.1 by default.
Mutate Chroma/vector collections, trigger reindexing, or write to RAG state.
Process private document/image directories or private transcript dumps for smoke testing.
Log raw prompts by default beyond normal foreground stderr during local review.
Claim NPU success from HTTP 200 alone.

Endpoint contract

All HTTP endpoints are local-only by default.

Base URL:

http://127.0.0.1:18819

GET `/healthz`, `/health`, `/readyz`, `/`

Purpose: liveness/readiness metadata.

Response fields:

status: starting | ok
service: atlas-router-classifier
version: service version string
mode: always dry_run
model: model/runtime label
embed_url: upstream embeddings URL
device: expected to say NPU-via-embedding-service or equivalent
labels: supported label names
embedding_dim: embedding dimension after warmup
prototype_count: number of synthetic prototype examples loaded
prototype_npu_busy_delta_us: warmup delta reported by upstream embeddings, if available
npu_busy_time_us: current sysfs counter value, if readable
warnings: list of non-fatal warnings

A healthy service is not enough to prove NPU execution. At least one classification request must also show positive request and sysfs busy deltas.

GET `/v1/labels`

Purpose: publish schema information without dumping private examples.

Response fields:

model
thresholds
- tool_needed: recommended threshold 0.72
- memory_candidate: recommended threshold 0.78
- safety_confirmation_required: recommended threshold 0.80
- workflow_category: recommended threshold 0.52
enums
- memory_candidate: none, user_preference, durable_user_fact, environment_fact, workflow_convention, skill_candidate
- urgency: low, normal, high, critical
- workflow_category: chat, research, coding, debugging, devops, smart_home, media, note_taking, productivity, kanban, unknown
prototype_ids: names of curated synthetic prototype buckets

POST `/v1/classify`

Purpose: classify one user/task message for advisory dry-run hints.

Request:

{
  "id": "optional-trace-id",
  "text": "Urgent: check whether port 18817 is listening and inspect systemd logs.",
  "context": {
    "platform": "cli",
    "source": "user"
  },
  "options": {
    "include_evidence": true,
    "include_embedding_debug": false,
    "dry_run": true
  }
}

Required behavior:

Reject empty text with HTTP 400.
Default dry_run to true.
Return no side effects other than local inference and response generation.
Include evidence by default unless include_evidence=false.
Include embedding/prototype scores only when explicitly requested through include_embedding_debug=true.

Response:

{
  "id": "optional-trace-id",
  "model": "bge-base-en-v1.5-int8-ov/prototype-router-v0",
  "created": 1780590000,
  "duration_ms": 12.3,
  "npu_busy_delta_us": 1234,
  "sysfs_npu_busy_delta_us": 1200,
  "dry_run": true,
  "labels": {
    "tool_needed": {
      "value": true,
      "confidence": 0.84,
      "threshold": 0.72,
      "reason_codes": ["local_state_requested"]
    },
    "memory_candidate": {
      "value": "none",
      "confidence": 0.31,
      "threshold": 0.78,
      "reason_codes": []
    },
    "urgency": {
      "value": "high",
      "confidence": 0.84,
      "scores": {"low": 0.0, "normal": 0.2, "high": 0.84, "critical": 0.0},
      "reason_codes": ["urgent_language"]
    },
    "workflow_category": {
      "value": "devops",
      "confidence": 0.86,
      "scores": {"devops": 0.86, "unknown": 0.14}
    },
    "safety_confirmation_required": {
      "value": false,
      "confidence": 0.0,
      "threshold": 0.8,
      "reason_codes": []
    }
  },
  "warnings": [],
  "evidence": []
}

POST `/v1/batch_classify`

Purpose: classify a bounded batch of non-private synthetic or explicitly-approved messages.

Request:

{
  "items": [
    {"id": "m1", "text": "What time is it in Seattle right now?"},
    {"id": "m2", "text": "Restart the live Atlas gateway and switch primary routing."}
  ],
  "options": {"include_evidence": false, "dry_run": true}
}

Response:

model
duration_ms
aggregate npu_busy_delta_us
results: array of /v1/classify responses

Batch limits for prototype review:

Keep batches small; the prototype rejects empty batches and batches larger than OPENVINO_CLASSIFIER_MAX_BATCH_SIZE (default 32).
Use only synthetic fixtures unless Will explicitly approves a real non-private sample set.
Do not retain request bodies to disk.

CLI contract

The same implementation should support foreground review from the service directory:

cd /home/will/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py \
  --host 127.0.0.1 \
  --port 18819 \
  --embed-url http://127.0.0.1:18817/v1/embeddings

Required flags/env:

--host / OPENVINO_CLASSIFIER_HOST; default 127.0.0.1.
--port / OPENVINO_CLASSIFIER_PORT; default 18819.
--embed-url / OPENVINO_CLASSIFIER_EMBED_URL; default http://127.0.0.1:18817/v1/embeddings.
--timeout-s / OPENVINO_CLASSIFIER_TIMEOUT_S; default 30.
--max-batch-size / OPENVINO_CLASSIFIER_MAX_BATCH_SIZE; default 32.
--no-warmup to defer prototype embedding until first request.

A future dedicated CLI mode may be added for one-shot JSONL classification, but foreground HTTP review is sufficient for the dry-run contract.

Synthetic smoke-test plan

Preconditions:

Confirm :18817 embeddings service is healthy.
Confirm :18819 is not already listening.
Read /sys/class/accel/accel0/device/npu_busy_time_us before starting the request smoke.
Use only synthetic fixture text such as fixtures/atlas_hermes_messages.jsonl.

Unit/schema smoke, no NPU dependency:

cd /home/will/lab/swarm
/home/will/.venvs/npu/bin/python -m unittest discover -s openvino-classifier-npu/tests -v

Foreground service smoke:

ss -ltnp | grep ':18819\b' || true
cd /home/will/lab/swarm/openvino-classifier-npu
/home/will/.venvs/npu/bin/python router_classifier.py --host 127.0.0.1 --port 18819

From another shell:

curl -fsS http://127.0.0.1:18819/healthz | jq .
curl -fsS http://127.0.0.1:18819/v1/labels | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
  -H 'Content-Type: application/json' \
  -d '{"id":"smoke-devops","text":"Urgent: check whether port 18817 is listening and inspect systemd logs.","options":{"include_evidence":true,"dry_run":true}}' | jq .
curl -fsS http://127.0.0.1:18819/v1/classify \
  -H 'Content-Type: application/json' \
  -d '{"id":"smoke-safety","text":"Restart the live Atlas gateway and switch primary routing to the new classifier.","options":{"include_evidence":true,"dry_run":true}}' | jq .

Expected label checks:

smoke-devops: tool_needed.value=true, urgency.value=high, workflow_category.value=devops.
smoke-safety: safety_confirmation_required.value=true, no actual restart or routing change.
Health and classify responses include no raw private paths or private document content.

Shutdown:

Stop the foreground server with Ctrl-C.
Re-run ss -ltnp | grep ':18819\b' || true and confirm no listener remains.

NPU busy-time verification plan

Use sysfs plus service response fields; do not accept HTTP 200 alone.

BUSY=/sys/class/accel/accel0/device/npu_busy_time_us
before=$(cat "$BUSY")
response=$(curl -fsS http://127.0.0.1:18819/v1/classify \
  -H 'Content-Type: application/json' \
  -d '{"id":"npu-proof","text":"Check current systemd service status for the embeddings service.","options":{"include_evidence":false,"dry_run":true}}')
after=$(cat "$BUSY")
echo "$response" | jq '{npu_busy_delta_us, sysfs_npu_busy_delta_us, warnings}'
echo "outer_sysfs_npu_busy_delta_us=$((after-before))"

Optional localhost smoke helper, after starting the foreground service:

/home/will/.venvs/npu/bin/python openvino-classifier-npu/smoke_classifier.py \
  --base-url http://127.0.0.1:18819

Acceptance for an NPU-backed classification request:

HTTP request succeeds.
Response npu_busy_delta_us > 0 from upstream embeddings.
Response sysfs_npu_busy_delta_us > 0 when sysfs is readable.
Outer shell after-before > 0.
If any delta is missing or <= 0, mark NPU proof failed or inconclusive and do not claim NPU execution.

Docs and diagram implications

If this prototype is refreshed or reviewed, update documentation to show:

Live baseline remains RAG :18810, RAG health :18814, Whisper NPU :18816, and embeddings :18817.
Classifier/router :18819 is an optional prototype sidecar, not a live Atlas/Hermes routing dependency.
Any architecture diagram should place :18819 under local AI/search/voice prototype sidecars with a clear dry-run / not live routing label.
Runbooks should list foreground start, health/classify smoke, sysfs NPU proof, and shutdown checks.
Service catalog entries should state not installed/enabled until Will approves persistent service enablement.
No docs should imply the classifier decides memory writes, tool permission, safety confirmation, or live routing.

Relevant docs inventory:

docs/swarm-infrastructure.md
docs/swarm-infrastructure.html
docs/diagram-maintenance.md
swarm-common/obsidian-vault/will/will-shared-zap/Runbooks/OpenVINO NPU Services Runbook.md
swarm-common/obsidian-vault/will/will-shared-zap/Resources/Service Catalog.md

No-go / defer criteria

Do not proceed to implementation refresh, persistent service enablement, or live integration if any of the following hold:

:18817 embeddings is unavailable and no approved NPU embedding fallback exists.
/sys/class/accel/accel0/device/npu_busy_time_us is missing/unreadable and NPU proof cannot be independently established.
Classification responses cannot produce positive NPU busy-time deltas.
:18819 is already occupied by an unknown or live service.
Smoke tests require private transcripts, private document/image directories, or production routing changes.
Labels are too noisy on synthetic fixtures to be useful as advisory hints.
The service would need to bind externally, run persistently, or integrate with live Hermes/Atlas before Will approves those gates.
Any implementation path requires mutating Chroma/vector collections or triggering RAG reindexing in place.

Implementation handoff notes

Recommended next engineer actions:

Verify or refresh openvino-classifier-npu/router_classifier.py to match this contract.
Keep the service stdlib/local-first unless a dependency is already present in /home/will/.venvs/npu.
Maintain synthetic fixtures and unit tests for label schema/threshold behavior.
Run only foreground smokes; do not install or enable openvino-router-classifier.service.
Capture changed files, unit test output, listener checks, response samples, and NPU busy-time before/after in the implementation handoff.

13 KiB Raw Permalink Blame History