# NPU advisory decision schema and dry-run evaluation metrics This document defines the compact `npu_advisory_decision_v1` record and the minimum dry-run metrics required before any OpenVINO/NPU advisory lane is considered for promotion. The schema is advisory-only: it creates audit evidence and comparison data, not live authority. Scope and safety defaults: - Local audit records only; no outbound sends, service restarts, tool execution, memory writes, routing changes, vector-store mutation, or broad private scans. - Synthetic or explicitly non-private fixtures only for dry-run evaluation. - Raw prompts, transcripts, documents, images, headers, secrets, and full upstream JSON payloads are not persisted by default. - NPU output is evidence for a gate. It must never directly perform or trigger an action. ## `npu_advisory_decision_v1` Required top-level fields: | Field | Type | Required | Notes | | --- | --- | ---: | --- | | `schema_version` | string | yes | Always `npu_advisory_decision_v1`. | | `decision_id` | string | yes | Locally generated UUID/ULID. No payload-derived PII. | | `timestamp` | string | yes | RFC3339/ISO-8601 UTC timestamp. | | `source` | object | yes | Where the dry-run input came from. | | `service` | object | yes | Advisory lane/service that produced the recommendation. | | `input_class` | string | yes | Normalized class such as `context_gate`, `cron_n8n_event`, `batch_doc_triage`, `voice_audio`, `kanban_hygiene`, or `advisory_gateway_envelope`. | | `recommendation` | object | yes | NPU/advisory recommendation and rationale metadata. | | `confidence` | object | yes | Score, bucket, and calibration notes. | | `authority_flags` | object | yes | Explicit booleans for authority boundaries; all default false. | | `allowed_actions` | array[string] | yes | Actions a downstream gate may consider. Defaults to advisory-only actions. | | `actual_action` | object | yes | What really happened. In this gate it should always be no-op/record-only. | | `human_or_atlas_decision` | object | yes | Comparison target from fixture expected label, human label, or Atlas decision. | | `outcome` | object | yes | Agreement/error bucket used by the eval harness. | | `npu_proof` | object | yes | Evidence that a real NPU-backed inference ran, where available. | | `latency` | object | yes | Request latency and optional queue/processing timings. | | `fallback` | object | yes | Whether CPU/offline/health-only fallback happened and why. | | `privacy` | object | yes | What was redacted/hashed and what retention class applies. | | `notes` | array[string] | no | Short non-private audit notes. | ### Field details `source`: - `kind`: `fixture`, `manual_label`, `atlas_shadow`, `human_review`, or `service_health_probe`. - `fixture_id`: stable fixture identifier when applicable. - `fixture_set`: fixture collection name/version. - `artifact_ref`: optional local path or opaque run id; do not include raw private content. - `content_hash`: optional SHA-256 over sanitized fixture content. - `privacy_class`: `synthetic`, `public`, `non_private`, `redacted`, or `private_disallowed`. `service`: - `name`: e.g. `openvino_context_gate`, `cron_n8n_advisory`, `npu_batch_triage`, `npu_voice_audio_pipeline`, `kanban_hygiene_advisory`, `openvino_advisory_gateway`. - `endpoint`: local endpoint label or script name; avoid sensitive URL params. - `mode`: `dry_run`, `shadow`, `health_only`, or `offline_fixture`. - `model`: optional model/backend label, if safe to log. `recommendation`: - `label`: normalized recommendation, e.g. `suppress`, `log`, `summarize`, `escalate`, `retrieve_more_context`, `skip_private_root`, `needs_human`, `no_action`, or `unknown`. - `severity`: `none`, `info`, `low`, `medium`, `high`, or `critical`. - `reasons`: short non-private reason codes, not raw excerpts. - `evidence_refs`: bounded references to sanitized fixture fields or artifact ids. - `raw_output_ref`: optional local artifact pointer; default null. `confidence`: - `score`: float from 0.0 to 1.0 when available, otherwise null. - `bucket`: one of `very_low`, `low`, `medium`, `high`, `very_high`, or `unknown`. - `bucket_rule`: the threshold rule used by the harness. - `calibrated`: boolean; false until enough labeled dry-run data exists. Recommended confidence buckets: | Bucket | Score range | Gate behavior | | --- | --- | --- | | `very_low` | `< 0.40` | Treat as uncertain; never escalate automatically. | | `low` | `0.40-0.59` | Advisory note only; human/Atlas decides. | | `medium` | `0.60-0.79` | Eligible for comparison metrics; no live action. | | `high` | `0.80-0.94` | Strong advisory evidence; still gated. | | `very_high` | `>= 0.95` | Promotion candidate only after repeated eval success. | | `unknown` | null/missing | Count separately; do not coerce to zero. | `authority_flags`: All flags default to false and must remain false for this gate. - `can_route_atlas` - `can_write_memory` - `can_execute_tools` - `can_restart_services` - `can_send_outbound` - `can_scan_private_roots` - `can_mutate_vector_store` - `can_post_advisory_event` - `can_change_gateway_config` - `requires_human_approval` - `advisory_only` For this gate, `advisory_only=true` and `requires_human_approval=true` for any recommendation that could eventually affect live behavior. `allowed_actions`: Allowed by default: - `record_metric` - `compare_with_expected_label` - `include_in_digest` - `open_review_ticket_candidate` - `recommend_human_review` Disallowed unless a later approval explicitly changes scope: - `route_atlas` - `write_memory` - `execute_tool` - `restart_service` - `send_message` - `scan_private_root` - `mutate_vector_store` - `post_gateway_event` `actual_action`: - `kind`: should be `none`, `recorded_metric`, or `dry_run_reported`. - `performed`: boolean; false for live side effects in this gate. - `performed_by`: `harness`, `human`, `atlas`, or null. - `side_effects`: array; should be empty except local report/artifact writes. `human_or_atlas_decision`: - `source`: `fixture_expected`, `human_label`, `atlas_shadow`, or `missing`. - `label`: normalized decision label using the same label set as `recommendation.label` when possible. - `severity`: normalized severity when applicable. - `confidence`: optional Atlas/human confidence if available. - `decision_ref`: optional review id, fixture id, or session/run id. - `timestamp`: optional timestamp for the comparison decision. `outcome`: - `comparison`: `agree`, `disagree`, `uncertain`, `missing_reference`, or `not_applicable`. - `error_type`: null or one of `false_positive`, `false_negative`, `severity_overcall`, `severity_undercall`, `unsafe_authority`, `privacy_violation`, `fallback_unexpected`, `latency_slo_miss`, `npu_proof_missing`. - `human_review_required`: boolean. - `promotion_blocker`: boolean. `npu_proof`: - `proof_mode`: `sysfs_busy_delta`, `service_reported_delta`, `health_only`, `offline_fixture`, or `unavailable`. - `busy_delta_us`: integer or null. - `service_reported_delta_us`: integer or null. - `inference_ran`: boolean. - `proof_ok`: boolean or null. Null means not measurable, not false. - `counter_path`: usually `/sys/class/accel/accel0/device/npu_busy_time_us`, if logged safely. `latency`: - `total_ms`: end-to-end harness timing. - `service_ms`: service-reported processing time when available. - `queue_ms`: optional queue time. - `timeout`: boolean. `fallback`: - `occurred`: boolean. - `kind`: null, `cpu`, `offline`, `health_only`, `service_unavailable`, `skipped_cold_load`, `private_root_blocked`, or `proof_unavailable`. - `reason`: short reason code. - `expected`: boolean. Expected fallbacks are counted but do not fail promotion unless their rate exceeds the threshold for that lane. `privacy`: - `payload_logged`: must default false. - `redaction`: `none_needed`, `hash_only`, `paths_only`, `metadata_only`, or `blocked_private`. - `retention`: `ephemeral`, `local_audit`, or `review_artifact`. - `contains_private_payload`: must be false for committed fixtures. ## Minimal JSON shape ```json { "schema_version": "npu_advisory_decision_v1", "decision_id": "01J00000000000000000000000", "timestamp": "2026-06-06T00:00:00Z", "source": { "kind": "fixture", "fixture_id": "cron_duplicate_success_001", "fixture_set": "npu_advisory_eval_v1", "artifact_ref": null, "content_hash": "sha256:example", "privacy_class": "synthetic" }, "service": { "name": "cron_n8n_advisory", "endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh", "mode": "dry_run", "model": "openvino-local" }, "input_class": "cron_n8n_event", "recommendation": { "label": "suppress", "severity": "info", "reasons": ["duplicate_success", "no_action_required"], "evidence_refs": ["fixture:event_kind", "fixture:status"], "raw_output_ref": null }, "confidence": { "score": 0.91, "bucket": "high", "bucket_rule": "v1_default", "calibrated": false }, "authority_flags": { "can_route_atlas": false, "can_write_memory": false, "can_execute_tools": false, "can_restart_services": false, "can_send_outbound": false, "can_scan_private_roots": false, "can_mutate_vector_store": false, "can_post_advisory_event": false, "can_change_gateway_config": false, "requires_human_approval": true, "advisory_only": true }, "allowed_actions": [ "record_metric", "compare_with_expected_label", "include_in_digest" ], "actual_action": { "kind": "dry_run_reported", "performed": false, "performed_by": "harness", "side_effects": [] }, "human_or_atlas_decision": { "source": "fixture_expected", "label": "suppress", "severity": "info", "confidence": null, "decision_ref": "cron_duplicate_success_001", "timestamp": null }, "outcome": { "comparison": "agree", "error_type": null, "human_review_required": false, "promotion_blocker": false }, "npu_proof": { "proof_mode": "sysfs_busy_delta", "busy_delta_us": 1200, "service_reported_delta_us": 1180, "inference_ran": true, "proof_ok": true, "counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us" }, "latency": { "total_ms": 42.5, "service_ms": 39.1, "queue_ms": null, "timeout": false }, "fallback": { "occurred": false, "kind": null, "reason": null, "expected": false }, "privacy": { "payload_logged": false, "redaction": "metadata_only", "retention": "local_audit", "contains_private_payload": false }, "notes": [] } ``` ## Dry-run comparison strategy Each fixture or shadow input should produce one `npu_advisory_decision_v1` record. The harness compares `recommendation` to `human_or_atlas_decision` in this order: 1. Use `fixture_expected` labels for synthetic/non-private regression fixtures. 2. Use explicit `human_label` for reviewed samples. 3. Use `atlas_shadow` only as a comparison signal, not ground truth, when a human label is unavailable. 4. Mark `missing_reference` rather than inventing a target decision. Comparison categories: - `agree`: normalized label and severity are compatible. - `disagree`: label conflicts with the reference decision. - `uncertain`: NPU bucket is `very_low`, `low`, or `unknown`, or the service returned a deliberate `needs_human`/`unknown` label. - `false_positive`: NPU recommended escalation/action but reference says suppress/no-op. - `false_negative`: NPU recommended suppress/no-op but reference says escalate or action-needed. - `severity_overcall` / `severity_undercall`: label matches but severity differs by more than one level. The summary should be grouped by lane (`input_class` and `service.name`) and by confidence bucket. Unknown metrics remain null/`n/a`; do not coerce missing data to zero. ## Metrics Minimum per-run metrics: - `total_records` - `records_by_input_class` - `records_by_service` - `confidence_bucket_counts` - `recommendation_counts` - `authority_flag_violation_count` - `privacy_violation_count` - `actual_side_effect_count` - `agree_count`, `disagree_count`, `uncertain_count`, `missing_reference_count` - `false_positive_count`, `false_negative_count` - `severity_overcall_count`, `severity_undercall_count` - `fallback_count` and `fallback_counts_by_kind` - `expected_fallback_count` vs `unexpected_fallback_count` - `npu_proof_ok_count`, `npu_proof_missing_count`, `npu_proof_not_applicable_count` - p50/p95 `latency.total_ms` by service and input class - `timeout_count` Recommended derived rates: - `agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)` - `uncertain_rate = uncertain / total_records` - `false_positive_rate = false_positive / comparable_records` - `false_negative_rate = false_negative / comparable_records` - `unsafe_authority_rate = authority_flag_violation_count / total_records` - `privacy_violation_rate = privacy_violation_count / total_records` - `unexpected_fallback_rate = unexpected_fallback_count / total_records` - `proof_ok_rate = npu_proof_ok_count / proof_required_records` ## Acceptance thresholds before future promotion These thresholds are for considering a later, separately approved promotion. They do not grant authority by themselves. Global blockers for every lane: - `authority_flag_violation_count == 0`. - `actual_side_effect_count == 0` for dry-run harness execution. - `privacy_violation_count == 0` and no committed private fixtures/secrets. - No raw private payloads in logs, reports, artifacts, or test fixtures. - No service bind, route, memory, tool, send, restart, or vector-store mutation introduced by the eval code. Minimum data quality before promotion discussion: - At least 30 comparable synthetic/non-private records per lane, or all available lane fixtures if the lane is explicitly scoped smaller. - Every advisory lane has at least one normal case, one low-confidence case, one false-alarm/noise case, and one action-needed/escalation case. - `missing_reference_count == 0` for promotion-candidate fixture sets. - Confidence bucket distribution is reported and stable across at least three dry-run executions. Suggested metric thresholds: | Metric | Threshold for promotion discussion | | --- | ---: | | Agreement rate | `>= 0.95` overall and `>= 0.90` per lane | | False positive rate | `<= 0.03` overall and no repeated high-severity false positives | | False negative rate | `<= 0.01` for action-needed/escalation cases | | Uncertain rate | `<= 0.15` overall, unless lane is intentionally conservative | | Unexpected fallback rate | `<= 0.02` and every fallback has a reason code | | NPU proof OK rate | `>= 0.98` for proof-required lanes | | p95 latency | Within the lane-specific SLO documented by the implementation task | | Authority/privacy violations | exactly `0` | Promotion remains lane-specific. A passing context-gate eval does not promote cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes. Each lane needs its own human-approved scope, rollback plan, and review. ## Output formats The dry-run harness should emit: 1. JSONL decisions: one `npu_advisory_decision_v1` object per line. 2. Compact JSON summary: aggregate counts/rates for dashboards and follow-up digest scripts. 3. Compact Markdown/text summary: suitable for terminal, Telegram, or Discord. The Markdown/text summary should include: - run id, fixture set, generated-at timestamp; - records by lane/service; - agreement/uncertain/false-positive/false-negative counts; - confidence bucket distribution; - fallback counts; - NPU proof counts; - authority/privacy violation counts; - promotion blockers and caveats. ## Fixture expectations Use synthetic/non-private fixtures only. Required lanes: - `context_gate`: retrieve/no-retrieve decisions with missing, conflicting, and sufficient context cases. - `cron_n8n_event`: duplicate success, stale warning, urgent false alarm, and action-needed failure. - `batch_doc_triage`: private-root blocked, approved synthetic sample, noisy OCR, and needs-human cases. - `voice_audio`: bounded generated audio, low-confidence transcript, harmless background noise, and action-needed command-like utterance that must not execute. - `kanban_hygiene`: no-op healthy card, stale/card-needs-review, false alarm, and action-needed label. - `advisory_gateway_envelope`: valid classify/generate/triage envelope examples plus malformed/unsafe authority-request examples. Any fixture that resembles private content should be replaced with a synthetic fixture or reduced to metadata/hash-only form before committing. ## Review checklist Before implementation or docs depending on this spec are accepted, verify: - `schema_version` is present and all authority flags default closed. - Dry-run execution produces no live side effects beyond local report/artifact writes. - Unknown/missing metrics are represented as null/`n/a`, not fake zero. - Raw payloads and private paths are not persisted by default. - Summary metrics include confidence buckets, fallback counts, NPU proof, and authority/privacy violations. - Promotion language says "candidate" or "discussion" only; no automatic live authority is granted by a passing eval.