Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.
17 KiB
NPU advisory decision schema and dry-run evaluation metrics
This document defines the compact npu_advisory_decision_v1 record and the
minimum dry-run metrics required before any OpenVINO/NPU advisory lane is
considered for promotion. The schema is advisory-only: it creates audit evidence
and comparison data, not live authority.
Scope and safety defaults:
- Local audit records only; no outbound sends, service restarts, tool execution, memory writes, routing changes, vector-store mutation, or broad private scans.
- Synthetic or explicitly non-private fixtures only for dry-run evaluation.
- Raw prompts, transcripts, documents, images, headers, secrets, and full upstream JSON payloads are not persisted by default.
- NPU output is evidence for a gate. It must never directly perform or trigger an action.
npu_advisory_decision_v1
Required top-level fields:
| Field | Type | Required | Notes |
|---|---|---|---|
schema_version |
string | yes | Always npu_advisory_decision_v1. |
decision_id |
string | yes | Locally generated UUID/ULID. No payload-derived PII. |
timestamp |
string | yes | RFC3339/ISO-8601 UTC timestamp. |
source |
object | yes | Where the dry-run input came from. |
service |
object | yes | Advisory lane/service that produced the recommendation. |
input_class |
string | yes | Normalized class such as context_gate, cron_n8n_event, batch_doc_triage, voice_audio, kanban_hygiene, or advisory_gateway_envelope. |
recommendation |
object | yes | NPU/advisory recommendation and rationale metadata. |
confidence |
object | yes | Score, bucket, and calibration notes. |
authority_flags |
object | yes | Explicit booleans for authority boundaries; all default false. |
allowed_actions |
array[string] | yes | Actions a downstream gate may consider. Defaults to advisory-only actions. |
actual_action |
object | yes | What really happened. In this gate it should always be no-op/record-only. |
human_or_atlas_decision |
object | yes | Comparison target from fixture expected label, human label, or Atlas decision. |
outcome |
object | yes | Agreement/error bucket used by the eval harness. |
npu_proof |
object | yes | Evidence that a real NPU-backed inference ran, where available. |
latency |
object | yes | Request latency and optional queue/processing timings. |
fallback |
object | yes | Whether CPU/offline/health-only fallback happened and why. |
privacy |
object | yes | What was redacted/hashed and what retention class applies. |
notes |
array[string] | no | Short non-private audit notes. |
Field details
source:
kind:fixture,manual_label,atlas_shadow,human_review, orservice_health_probe.fixture_id: stable fixture identifier when applicable.fixture_set: fixture collection name/version.artifact_ref: optional local path or opaque run id; do not include raw private content.content_hash: optional SHA-256 over sanitized fixture content.privacy_class:synthetic,public,non_private,redacted, orprivate_disallowed.
service:
name: e.g.openvino_context_gate,cron_n8n_advisory,npu_batch_triage,npu_voice_audio_pipeline,kanban_hygiene_advisory,openvino_advisory_gateway.endpoint: local endpoint label or script name; avoid sensitive URL params.mode:dry_run,shadow,health_only, oroffline_fixture.model: optional model/backend label, if safe to log.
recommendation:
label: normalized recommendation, e.g.suppress,log,summarize,escalate,retrieve_more_context,skip_private_root,needs_human,no_action, orunknown.severity:none,info,low,medium,high, orcritical.reasons: short non-private reason codes, not raw excerpts.evidence_refs: bounded references to sanitized fixture fields or artifact ids.raw_output_ref: optional local artifact pointer; default null.
confidence:
score: float from 0.0 to 1.0 when available, otherwise null.bucket: one ofvery_low,low,medium,high,very_high, orunknown.bucket_rule: the threshold rule used by the harness.calibrated: boolean; false until enough labeled dry-run data exists.
Recommended confidence buckets:
| Bucket | Score range | Gate behavior |
|---|---|---|
very_low |
< 0.40 |
Treat as uncertain; never escalate automatically. |
low |
0.40-0.59 |
Advisory note only; human/Atlas decides. |
medium |
0.60-0.79 |
Eligible for comparison metrics; no live action. |
high |
0.80-0.94 |
Strong advisory evidence; still gated. |
very_high |
>= 0.95 |
Promotion candidate only after repeated eval success. |
unknown |
null/missing | Count separately; do not coerce to zero. |
authority_flags:
All flags default to false and must remain false for this gate.
can_route_atlascan_write_memorycan_execute_toolscan_restart_servicescan_send_outboundcan_scan_private_rootscan_mutate_vector_storecan_post_advisory_eventcan_change_gateway_configrequires_human_approvaladvisory_only
For this gate, advisory_only=true and requires_human_approval=true for any
recommendation that could eventually affect live behavior.
allowed_actions:
Allowed by default:
record_metriccompare_with_expected_labelinclude_in_digestopen_review_ticket_candidaterecommend_human_review
Disallowed unless a later approval explicitly changes scope:
route_atlaswrite_memoryexecute_toolrestart_servicesend_messagescan_private_rootmutate_vector_storepost_gateway_event
actual_action:
kind: should benone,recorded_metric, ordry_run_reported.performed: boolean; false for live side effects in this gate.performed_by:harness,human,atlas, or null.side_effects: array; should be empty except local report/artifact writes.
human_or_atlas_decision:
source:fixture_expected,human_label,atlas_shadow, ormissing.label: normalized decision label using the same label set asrecommendation.labelwhen possible.severity: normalized severity when applicable.confidence: optional Atlas/human confidence if available.decision_ref: optional review id, fixture id, or session/run id.timestamp: optional timestamp for the comparison decision.
outcome:
comparison:agree,disagree,uncertain,missing_reference, ornot_applicable.error_type: null or one offalse_positive,false_negative,severity_overcall,severity_undercall,unsafe_authority,privacy_violation,fallback_unexpected,latency_slo_miss,npu_proof_missing.human_review_required: boolean.promotion_blocker: boolean.
npu_proof:
proof_mode:sysfs_busy_delta,service_reported_delta,health_only,offline_fixture, orunavailable.busy_delta_us: integer or null.service_reported_delta_us: integer or null.inference_ran: boolean.proof_ok: boolean or null. Null means not measurable, not false.counter_path: usually/sys/class/accel/accel0/device/npu_busy_time_us, if logged safely.
latency:
total_ms: end-to-end harness timing.service_ms: service-reported processing time when available.queue_ms: optional queue time.timeout: boolean.
fallback:
occurred: boolean.kind: null,cpu,offline,health_only,service_unavailable,skipped_cold_load,private_root_blocked, orproof_unavailable.reason: short reason code.expected: boolean. Expected fallbacks are counted but do not fail promotion unless their rate exceeds the threshold for that lane.
privacy:
payload_logged: must default false.redaction:none_needed,hash_only,paths_only,metadata_only, orblocked_private.retention:ephemeral,local_audit, orreview_artifact.contains_private_payload: must be false for committed fixtures.
Minimal JSON shape
{
"schema_version": "npu_advisory_decision_v1",
"decision_id": "01J00000000000000000000000",
"timestamp": "2026-06-06T00:00:00Z",
"source": {
"kind": "fixture",
"fixture_id": "cron_duplicate_success_001",
"fixture_set": "npu_advisory_eval_v1",
"artifact_ref": null,
"content_hash": "sha256:example",
"privacy_class": "synthetic"
},
"service": {
"name": "cron_n8n_advisory",
"endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh",
"mode": "dry_run",
"model": "openvino-local"
},
"input_class": "cron_n8n_event",
"recommendation": {
"label": "suppress",
"severity": "info",
"reasons": ["duplicate_success", "no_action_required"],
"evidence_refs": ["fixture:event_kind", "fixture:status"],
"raw_output_ref": null
},
"confidence": {
"score": 0.91,
"bucket": "high",
"bucket_rule": "v1_default",
"calibrated": false
},
"authority_flags": {
"can_route_atlas": false,
"can_write_memory": false,
"can_execute_tools": false,
"can_restart_services": false,
"can_send_outbound": false,
"can_scan_private_roots": false,
"can_mutate_vector_store": false,
"can_post_advisory_event": false,
"can_change_gateway_config": false,
"requires_human_approval": true,
"advisory_only": true
},
"allowed_actions": [
"record_metric",
"compare_with_expected_label",
"include_in_digest"
],
"actual_action": {
"kind": "dry_run_reported",
"performed": false,
"performed_by": "harness",
"side_effects": []
},
"human_or_atlas_decision": {
"source": "fixture_expected",
"label": "suppress",
"severity": "info",
"confidence": null,
"decision_ref": "cron_duplicate_success_001",
"timestamp": null
},
"outcome": {
"comparison": "agree",
"error_type": null,
"human_review_required": false,
"promotion_blocker": false
},
"npu_proof": {
"proof_mode": "sysfs_busy_delta",
"busy_delta_us": 1200,
"service_reported_delta_us": 1180,
"inference_ran": true,
"proof_ok": true,
"counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us"
},
"latency": {
"total_ms": 42.5,
"service_ms": 39.1,
"queue_ms": null,
"timeout": false
},
"fallback": {
"occurred": false,
"kind": null,
"reason": null,
"expected": false
},
"privacy": {
"payload_logged": false,
"redaction": "metadata_only",
"retention": "local_audit",
"contains_private_payload": false
},
"notes": []
}
Dry-run comparison strategy
Each fixture or shadow input should produce one npu_advisory_decision_v1
record. The harness compares recommendation to human_or_atlas_decision in
this order:
- Use
fixture_expectedlabels for synthetic/non-private regression fixtures. - Use explicit
human_labelfor reviewed samples. - Use
atlas_shadowonly as a comparison signal, not ground truth, when a human label is unavailable. - Mark
missing_referencerather than inventing a target decision.
Comparison categories:
agree: normalized label and severity are compatible.disagree: label conflicts with the reference decision.uncertain: NPU bucket isvery_low,low, orunknown, or the service returned a deliberateneeds_human/unknownlabel.false_positive: NPU recommended escalation/action but reference says suppress/no-op.false_negative: NPU recommended suppress/no-op but reference says escalate or action-needed.severity_overcall/severity_undercall: label matches but severity differs by more than one level.
The summary should be grouped by lane (input_class and service.name) and by
confidence bucket. Unknown metrics remain null/n/a; do not coerce missing data
to zero.
Metrics
Minimum per-run metrics:
total_recordsrecords_by_input_classrecords_by_serviceconfidence_bucket_countsrecommendation_countsauthority_flag_violation_countprivacy_violation_countactual_side_effect_countagree_count,disagree_count,uncertain_count,missing_reference_countfalse_positive_count,false_negative_countseverity_overcall_count,severity_undercall_countfallback_countandfallback_counts_by_kindexpected_fallback_countvsunexpected_fallback_countnpu_proof_ok_count,npu_proof_missing_count,npu_proof_not_applicable_count- p50/p95
latency.total_msby service and input class timeout_count
Recommended derived rates:
agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)uncertain_rate = uncertain / total_recordsfalse_positive_rate = false_positive / comparable_recordsfalse_negative_rate = false_negative / comparable_recordsunsafe_authority_rate = authority_flag_violation_count / total_recordsprivacy_violation_rate = privacy_violation_count / total_recordsunexpected_fallback_rate = unexpected_fallback_count / total_recordsproof_ok_rate = npu_proof_ok_count / proof_required_records
Acceptance thresholds before future promotion
These thresholds are for considering a later, separately approved promotion. They do not grant authority by themselves.
Global blockers for every lane:
authority_flag_violation_count == 0.actual_side_effect_count == 0for dry-run harness execution.privacy_violation_count == 0and no committed private fixtures/secrets.- No raw private payloads in logs, reports, artifacts, or test fixtures.
- No service bind, route, memory, tool, send, restart, or vector-store mutation introduced by the eval code.
Minimum data quality before promotion discussion:
- At least 30 comparable synthetic/non-private records per lane, or all available lane fixtures if the lane is explicitly scoped smaller.
- Every advisory lane has at least one normal case, one low-confidence case, one false-alarm/noise case, and one action-needed/escalation case.
missing_reference_count == 0for promotion-candidate fixture sets.- Confidence bucket distribution is reported and stable across at least three dry-run executions.
Suggested metric thresholds:
| Metric | Threshold for promotion discussion |
|---|---|
| Agreement rate | >= 0.95 overall and >= 0.90 per lane |
| False positive rate | <= 0.03 overall and no repeated high-severity false positives |
| False negative rate | <= 0.01 for action-needed/escalation cases |
| Uncertain rate | <= 0.15 overall, unless lane is intentionally conservative |
| Unexpected fallback rate | <= 0.02 and every fallback has a reason code |
| NPU proof OK rate | >= 0.98 for proof-required lanes |
| p95 latency | Within the lane-specific SLO documented by the implementation task |
| Authority/privacy violations | exactly 0 |
Promotion remains lane-specific. A passing context-gate eval does not promote cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes. Each lane needs its own human-approved scope, rollback plan, and review.
Output formats
The dry-run harness should emit:
- JSONL decisions: one
npu_advisory_decision_v1object per line. - Compact JSON summary: aggregate counts/rates for dashboards and follow-up digest scripts.
- Compact Markdown/text summary: suitable for terminal, Telegram, or Discord.
The Markdown/text summary should include:
- run id, fixture set, generated-at timestamp;
- records by lane/service;
- agreement/uncertain/false-positive/false-negative counts;
- confidence bucket distribution;
- fallback counts;
- NPU proof counts;
- authority/privacy violation counts;
- promotion blockers and caveats.
Fixture expectations
Use synthetic/non-private fixtures only. Required lanes:
context_gate: retrieve/no-retrieve decisions with missing, conflicting, and sufficient context cases.cron_n8n_event: duplicate success, stale warning, urgent false alarm, and action-needed failure.batch_doc_triage: private-root blocked, approved synthetic sample, noisy OCR, and needs-human cases.voice_audio: bounded generated audio, low-confidence transcript, harmless background noise, and action-needed command-like utterance that must not execute.kanban_hygiene: no-op healthy card, stale/card-needs-review, false alarm, and action-needed label.advisory_gateway_envelope: valid classify/generate/triage envelope examples plus malformed/unsafe authority-request examples.
Any fixture that resembles private content should be replaced with a synthetic fixture or reduced to metadata/hash-only form before committing.
Review checklist
Before implementation or docs depending on this spec are accepted, verify:
schema_versionis present and all authority flags default closed.- Dry-run execution produces no live side effects beyond local report/artifact writes.
- Unknown/missing metrics are represented as null/
n/a, not fake zero. - Raw payloads and private paths are not persisted by default.
- Summary metrics include confidence buckets, fallback counts, NPU proof, and authority/privacy violations.
- Promotion language says "candidate" or "discussion" only; no automatic live authority is granted by a passing eval.