dae2a57124
Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.
457 lines
17 KiB
Markdown
457 lines
17 KiB
Markdown
# NPU advisory decision schema and dry-run evaluation metrics
|
|
|
|
This document defines the compact `npu_advisory_decision_v1` record and the
|
|
minimum dry-run metrics required before any OpenVINO/NPU advisory lane is
|
|
considered for promotion. The schema is advisory-only: it creates audit evidence
|
|
and comparison data, not live authority.
|
|
|
|
Scope and safety defaults:
|
|
|
|
- Local audit records only; no outbound sends, service restarts, tool execution,
|
|
memory writes, routing changes, vector-store mutation, or broad private scans.
|
|
- Synthetic or explicitly non-private fixtures only for dry-run evaluation.
|
|
- Raw prompts, transcripts, documents, images, headers, secrets, and full upstream
|
|
JSON payloads are not persisted by default.
|
|
- NPU output is evidence for a gate. It must never directly perform or trigger
|
|
an action.
|
|
|
|
## `npu_advisory_decision_v1`
|
|
|
|
Required top-level fields:
|
|
|
|
| Field | Type | Required | Notes |
|
|
| --- | --- | ---: | --- |
|
|
| `schema_version` | string | yes | Always `npu_advisory_decision_v1`. |
|
|
| `decision_id` | string | yes | Locally generated UUID/ULID. No payload-derived PII. |
|
|
| `timestamp` | string | yes | RFC3339/ISO-8601 UTC timestamp. |
|
|
| `source` | object | yes | Where the dry-run input came from. |
|
|
| `service` | object | yes | Advisory lane/service that produced the recommendation. |
|
|
| `input_class` | string | yes | Normalized class such as `context_gate`, `cron_n8n_event`, `batch_doc_triage`, `voice_audio`, `kanban_hygiene`, or `advisory_gateway_envelope`. |
|
|
| `recommendation` | object | yes | NPU/advisory recommendation and rationale metadata. |
|
|
| `confidence` | object | yes | Score, bucket, and calibration notes. |
|
|
| `authority_flags` | object | yes | Explicit booleans for authority boundaries; all default false. |
|
|
| `allowed_actions` | array[string] | yes | Actions a downstream gate may consider. Defaults to advisory-only actions. |
|
|
| `actual_action` | object | yes | What really happened. In this gate it should always be no-op/record-only. |
|
|
| `human_or_atlas_decision` | object | yes | Comparison target from fixture expected label, human label, or Atlas decision. |
|
|
| `outcome` | object | yes | Agreement/error bucket used by the eval harness. |
|
|
| `npu_proof` | object | yes | Evidence that a real NPU-backed inference ran, where available. |
|
|
| `latency` | object | yes | Request latency and optional queue/processing timings. |
|
|
| `fallback` | object | yes | Whether CPU/offline/health-only fallback happened and why. |
|
|
| `privacy` | object | yes | What was redacted/hashed and what retention class applies. |
|
|
| `notes` | array[string] | no | Short non-private audit notes. |
|
|
|
|
### Field details
|
|
|
|
`source`:
|
|
|
|
- `kind`: `fixture`, `manual_label`, `atlas_shadow`, `human_review`, or
|
|
`service_health_probe`.
|
|
- `fixture_id`: stable fixture identifier when applicable.
|
|
- `fixture_set`: fixture collection name/version.
|
|
- `artifact_ref`: optional local path or opaque run id; do not include raw
|
|
private content.
|
|
- `content_hash`: optional SHA-256 over sanitized fixture content.
|
|
- `privacy_class`: `synthetic`, `public`, `non_private`, `redacted`, or
|
|
`private_disallowed`.
|
|
|
|
`service`:
|
|
|
|
- `name`: e.g. `openvino_context_gate`, `cron_n8n_advisory`,
|
|
`npu_batch_triage`, `npu_voice_audio_pipeline`, `kanban_hygiene_advisory`,
|
|
`openvino_advisory_gateway`.
|
|
- `endpoint`: local endpoint label or script name; avoid sensitive URL params.
|
|
- `mode`: `dry_run`, `shadow`, `health_only`, or `offline_fixture`.
|
|
- `model`: optional model/backend label, if safe to log.
|
|
|
|
`recommendation`:
|
|
|
|
- `label`: normalized recommendation, e.g. `suppress`, `log`, `summarize`,
|
|
`escalate`, `retrieve_more_context`, `skip_private_root`, `needs_human`,
|
|
`no_action`, or `unknown`.
|
|
- `severity`: `none`, `info`, `low`, `medium`, `high`, or `critical`.
|
|
- `reasons`: short non-private reason codes, not raw excerpts.
|
|
- `evidence_refs`: bounded references to sanitized fixture fields or artifact ids.
|
|
- `raw_output_ref`: optional local artifact pointer; default null.
|
|
|
|
`confidence`:
|
|
|
|
- `score`: float from 0.0 to 1.0 when available, otherwise null.
|
|
- `bucket`: one of `very_low`, `low`, `medium`, `high`, `very_high`, or
|
|
`unknown`.
|
|
- `bucket_rule`: the threshold rule used by the harness.
|
|
- `calibrated`: boolean; false until enough labeled dry-run data exists.
|
|
|
|
Recommended confidence buckets:
|
|
|
|
| Bucket | Score range | Gate behavior |
|
|
| --- | --- | --- |
|
|
| `very_low` | `< 0.40` | Treat as uncertain; never escalate automatically. |
|
|
| `low` | `0.40-0.59` | Advisory note only; human/Atlas decides. |
|
|
| `medium` | `0.60-0.79` | Eligible for comparison metrics; no live action. |
|
|
| `high` | `0.80-0.94` | Strong advisory evidence; still gated. |
|
|
| `very_high` | `>= 0.95` | Promotion candidate only after repeated eval success. |
|
|
| `unknown` | null/missing | Count separately; do not coerce to zero. |
|
|
|
|
`authority_flags`:
|
|
|
|
All flags default to false and must remain false for this gate.
|
|
|
|
- `can_route_atlas`
|
|
- `can_write_memory`
|
|
- `can_execute_tools`
|
|
- `can_restart_services`
|
|
- `can_send_outbound`
|
|
- `can_scan_private_roots`
|
|
- `can_mutate_vector_store`
|
|
- `can_post_advisory_event`
|
|
- `can_change_gateway_config`
|
|
- `requires_human_approval`
|
|
- `advisory_only`
|
|
|
|
For this gate, `advisory_only=true` and `requires_human_approval=true` for any
|
|
recommendation that could eventually affect live behavior.
|
|
|
|
`allowed_actions`:
|
|
|
|
Allowed by default:
|
|
|
|
- `record_metric`
|
|
- `compare_with_expected_label`
|
|
- `include_in_digest`
|
|
- `open_review_ticket_candidate`
|
|
- `recommend_human_review`
|
|
|
|
Disallowed unless a later approval explicitly changes scope:
|
|
|
|
- `route_atlas`
|
|
- `write_memory`
|
|
- `execute_tool`
|
|
- `restart_service`
|
|
- `send_message`
|
|
- `scan_private_root`
|
|
- `mutate_vector_store`
|
|
- `post_gateway_event`
|
|
|
|
`actual_action`:
|
|
|
|
- `kind`: should be `none`, `recorded_metric`, or `dry_run_reported`.
|
|
- `performed`: boolean; false for live side effects in this gate.
|
|
- `performed_by`: `harness`, `human`, `atlas`, or null.
|
|
- `side_effects`: array; should be empty except local report/artifact writes.
|
|
|
|
`human_or_atlas_decision`:
|
|
|
|
- `source`: `fixture_expected`, `human_label`, `atlas_shadow`, or `missing`.
|
|
- `label`: normalized decision label using the same label set as
|
|
`recommendation.label` when possible.
|
|
- `severity`: normalized severity when applicable.
|
|
- `confidence`: optional Atlas/human confidence if available.
|
|
- `decision_ref`: optional review id, fixture id, or session/run id.
|
|
- `timestamp`: optional timestamp for the comparison decision.
|
|
|
|
`outcome`:
|
|
|
|
- `comparison`: `agree`, `disagree`, `uncertain`, `missing_reference`, or
|
|
`not_applicable`.
|
|
- `error_type`: null or one of `false_positive`, `false_negative`,
|
|
`severity_overcall`, `severity_undercall`, `unsafe_authority`,
|
|
`privacy_violation`, `fallback_unexpected`, `latency_slo_miss`,
|
|
`npu_proof_missing`.
|
|
- `human_review_required`: boolean.
|
|
- `promotion_blocker`: boolean.
|
|
|
|
`npu_proof`:
|
|
|
|
- `proof_mode`: `sysfs_busy_delta`, `service_reported_delta`, `health_only`,
|
|
`offline_fixture`, or `unavailable`.
|
|
- `busy_delta_us`: integer or null.
|
|
- `service_reported_delta_us`: integer or null.
|
|
- `inference_ran`: boolean.
|
|
- `proof_ok`: boolean or null. Null means not measurable, not false.
|
|
- `counter_path`: usually `/sys/class/accel/accel0/device/npu_busy_time_us`, if
|
|
logged safely.
|
|
|
|
`latency`:
|
|
|
|
- `total_ms`: end-to-end harness timing.
|
|
- `service_ms`: service-reported processing time when available.
|
|
- `queue_ms`: optional queue time.
|
|
- `timeout`: boolean.
|
|
|
|
`fallback`:
|
|
|
|
- `occurred`: boolean.
|
|
- `kind`: null, `cpu`, `offline`, `health_only`, `service_unavailable`,
|
|
`skipped_cold_load`, `private_root_blocked`, or `proof_unavailable`.
|
|
- `reason`: short reason code.
|
|
- `expected`: boolean. Expected fallbacks are counted but do not fail promotion
|
|
unless their rate exceeds the threshold for that lane.
|
|
|
|
`privacy`:
|
|
|
|
- `payload_logged`: must default false.
|
|
- `redaction`: `none_needed`, `hash_only`, `paths_only`, `metadata_only`, or
|
|
`blocked_private`.
|
|
- `retention`: `ephemeral`, `local_audit`, or `review_artifact`.
|
|
- `contains_private_payload`: must be false for committed fixtures.
|
|
|
|
## Minimal JSON shape
|
|
|
|
```json
|
|
{
|
|
"schema_version": "npu_advisory_decision_v1",
|
|
"decision_id": "01J00000000000000000000000",
|
|
"timestamp": "2026-06-06T00:00:00Z",
|
|
"source": {
|
|
"kind": "fixture",
|
|
"fixture_id": "cron_duplicate_success_001",
|
|
"fixture_set": "npu_advisory_eval_v1",
|
|
"artifact_ref": null,
|
|
"content_hash": "sha256:example",
|
|
"privacy_class": "synthetic"
|
|
},
|
|
"service": {
|
|
"name": "cron_n8n_advisory",
|
|
"endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh",
|
|
"mode": "dry_run",
|
|
"model": "openvino-local"
|
|
},
|
|
"input_class": "cron_n8n_event",
|
|
"recommendation": {
|
|
"label": "suppress",
|
|
"severity": "info",
|
|
"reasons": ["duplicate_success", "no_action_required"],
|
|
"evidence_refs": ["fixture:event_kind", "fixture:status"],
|
|
"raw_output_ref": null
|
|
},
|
|
"confidence": {
|
|
"score": 0.91,
|
|
"bucket": "high",
|
|
"bucket_rule": "v1_default",
|
|
"calibrated": false
|
|
},
|
|
"authority_flags": {
|
|
"can_route_atlas": false,
|
|
"can_write_memory": false,
|
|
"can_execute_tools": false,
|
|
"can_restart_services": false,
|
|
"can_send_outbound": false,
|
|
"can_scan_private_roots": false,
|
|
"can_mutate_vector_store": false,
|
|
"can_post_advisory_event": false,
|
|
"can_change_gateway_config": false,
|
|
"requires_human_approval": true,
|
|
"advisory_only": true
|
|
},
|
|
"allowed_actions": [
|
|
"record_metric",
|
|
"compare_with_expected_label",
|
|
"include_in_digest"
|
|
],
|
|
"actual_action": {
|
|
"kind": "dry_run_reported",
|
|
"performed": false,
|
|
"performed_by": "harness",
|
|
"side_effects": []
|
|
},
|
|
"human_or_atlas_decision": {
|
|
"source": "fixture_expected",
|
|
"label": "suppress",
|
|
"severity": "info",
|
|
"confidence": null,
|
|
"decision_ref": "cron_duplicate_success_001",
|
|
"timestamp": null
|
|
},
|
|
"outcome": {
|
|
"comparison": "agree",
|
|
"error_type": null,
|
|
"human_review_required": false,
|
|
"promotion_blocker": false
|
|
},
|
|
"npu_proof": {
|
|
"proof_mode": "sysfs_busy_delta",
|
|
"busy_delta_us": 1200,
|
|
"service_reported_delta_us": 1180,
|
|
"inference_ran": true,
|
|
"proof_ok": true,
|
|
"counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us"
|
|
},
|
|
"latency": {
|
|
"total_ms": 42.5,
|
|
"service_ms": 39.1,
|
|
"queue_ms": null,
|
|
"timeout": false
|
|
},
|
|
"fallback": {
|
|
"occurred": false,
|
|
"kind": null,
|
|
"reason": null,
|
|
"expected": false
|
|
},
|
|
"privacy": {
|
|
"payload_logged": false,
|
|
"redaction": "metadata_only",
|
|
"retention": "local_audit",
|
|
"contains_private_payload": false
|
|
},
|
|
"notes": []
|
|
}
|
|
```
|
|
|
|
## Dry-run comparison strategy
|
|
|
|
Each fixture or shadow input should produce one `npu_advisory_decision_v1`
|
|
record. The harness compares `recommendation` to `human_or_atlas_decision` in
|
|
this order:
|
|
|
|
1. Use `fixture_expected` labels for synthetic/non-private regression fixtures.
|
|
2. Use explicit `human_label` for reviewed samples.
|
|
3. Use `atlas_shadow` only as a comparison signal, not ground truth, when a human
|
|
label is unavailable.
|
|
4. Mark `missing_reference` rather than inventing a target decision.
|
|
|
|
Comparison categories:
|
|
|
|
- `agree`: normalized label and severity are compatible.
|
|
- `disagree`: label conflicts with the reference decision.
|
|
- `uncertain`: NPU bucket is `very_low`, `low`, or `unknown`, or the service
|
|
returned a deliberate `needs_human`/`unknown` label.
|
|
- `false_positive`: NPU recommended escalation/action but reference says
|
|
suppress/no-op.
|
|
- `false_negative`: NPU recommended suppress/no-op but reference says escalate or
|
|
action-needed.
|
|
- `severity_overcall` / `severity_undercall`: label matches but severity differs
|
|
by more than one level.
|
|
|
|
The summary should be grouped by lane (`input_class` and `service.name`) and by
|
|
confidence bucket. Unknown metrics remain null/`n/a`; do not coerce missing data
|
|
to zero.
|
|
|
|
## Metrics
|
|
|
|
Minimum per-run metrics:
|
|
|
|
- `total_records`
|
|
- `records_by_input_class`
|
|
- `records_by_service`
|
|
- `confidence_bucket_counts`
|
|
- `recommendation_counts`
|
|
- `authority_flag_violation_count`
|
|
- `privacy_violation_count`
|
|
- `actual_side_effect_count`
|
|
- `agree_count`, `disagree_count`, `uncertain_count`, `missing_reference_count`
|
|
- `false_positive_count`, `false_negative_count`
|
|
- `severity_overcall_count`, `severity_undercall_count`
|
|
- `fallback_count` and `fallback_counts_by_kind`
|
|
- `expected_fallback_count` vs `unexpected_fallback_count`
|
|
- `npu_proof_ok_count`, `npu_proof_missing_count`, `npu_proof_not_applicable_count`
|
|
- p50/p95 `latency.total_ms` by service and input class
|
|
- `timeout_count`
|
|
|
|
Recommended derived rates:
|
|
|
|
- `agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)`
|
|
- `uncertain_rate = uncertain / total_records`
|
|
- `false_positive_rate = false_positive / comparable_records`
|
|
- `false_negative_rate = false_negative / comparable_records`
|
|
- `unsafe_authority_rate = authority_flag_violation_count / total_records`
|
|
- `privacy_violation_rate = privacy_violation_count / total_records`
|
|
- `unexpected_fallback_rate = unexpected_fallback_count / total_records`
|
|
- `proof_ok_rate = npu_proof_ok_count / proof_required_records`
|
|
|
|
## Acceptance thresholds before future promotion
|
|
|
|
These thresholds are for considering a later, separately approved promotion.
|
|
They do not grant authority by themselves.
|
|
|
|
Global blockers for every lane:
|
|
|
|
- `authority_flag_violation_count == 0`.
|
|
- `actual_side_effect_count == 0` for dry-run harness execution.
|
|
- `privacy_violation_count == 0` and no committed private fixtures/secrets.
|
|
- No raw private payloads in logs, reports, artifacts, or test fixtures.
|
|
- No service bind, route, memory, tool, send, restart, or vector-store mutation
|
|
introduced by the eval code.
|
|
|
|
Minimum data quality before promotion discussion:
|
|
|
|
- At least 30 comparable synthetic/non-private records per lane, or all available
|
|
lane fixtures if the lane is explicitly scoped smaller.
|
|
- Every advisory lane has at least one normal case, one low-confidence case, one
|
|
false-alarm/noise case, and one action-needed/escalation case.
|
|
- `missing_reference_count == 0` for promotion-candidate fixture sets.
|
|
- Confidence bucket distribution is reported and stable across at least three
|
|
dry-run executions.
|
|
|
|
Suggested metric thresholds:
|
|
|
|
| Metric | Threshold for promotion discussion |
|
|
| --- | ---: |
|
|
| Agreement rate | `>= 0.95` overall and `>= 0.90` per lane |
|
|
| False positive rate | `<= 0.03` overall and no repeated high-severity false positives |
|
|
| False negative rate | `<= 0.01` for action-needed/escalation cases |
|
|
| Uncertain rate | `<= 0.15` overall, unless lane is intentionally conservative |
|
|
| Unexpected fallback rate | `<= 0.02` and every fallback has a reason code |
|
|
| NPU proof OK rate | `>= 0.98` for proof-required lanes |
|
|
| p95 latency | Within the lane-specific SLO documented by the implementation task |
|
|
| Authority/privacy violations | exactly `0` |
|
|
|
|
Promotion remains lane-specific. A passing context-gate eval does not promote
|
|
cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes.
|
|
Each lane needs its own human-approved scope, rollback plan, and review.
|
|
|
|
## Output formats
|
|
|
|
The dry-run harness should emit:
|
|
|
|
1. JSONL decisions: one `npu_advisory_decision_v1` object per line.
|
|
2. Compact JSON summary: aggregate counts/rates for dashboards and follow-up
|
|
digest scripts.
|
|
3. Compact Markdown/text summary: suitable for terminal, Telegram, or Discord.
|
|
|
|
The Markdown/text summary should include:
|
|
|
|
- run id, fixture set, generated-at timestamp;
|
|
- records by lane/service;
|
|
- agreement/uncertain/false-positive/false-negative counts;
|
|
- confidence bucket distribution;
|
|
- fallback counts;
|
|
- NPU proof counts;
|
|
- authority/privacy violation counts;
|
|
- promotion blockers and caveats.
|
|
|
|
## Fixture expectations
|
|
|
|
Use synthetic/non-private fixtures only. Required lanes:
|
|
|
|
- `context_gate`: retrieve/no-retrieve decisions with missing, conflicting, and
|
|
sufficient context cases.
|
|
- `cron_n8n_event`: duplicate success, stale warning, urgent false alarm, and
|
|
action-needed failure.
|
|
- `batch_doc_triage`: private-root blocked, approved synthetic sample, noisy OCR,
|
|
and needs-human cases.
|
|
- `voice_audio`: bounded generated audio, low-confidence transcript, harmless
|
|
background noise, and action-needed command-like utterance that must not
|
|
execute.
|
|
- `kanban_hygiene`: no-op healthy card, stale/card-needs-review, false alarm, and
|
|
action-needed label.
|
|
- `advisory_gateway_envelope`: valid classify/generate/triage envelope examples
|
|
plus malformed/unsafe authority-request examples.
|
|
|
|
Any fixture that resembles private content should be replaced with a synthetic
|
|
fixture or reduced to metadata/hash-only form before committing.
|
|
|
|
## Review checklist
|
|
|
|
Before implementation or docs depending on this spec are accepted, verify:
|
|
|
|
- `schema_version` is present and all authority flags default closed.
|
|
- Dry-run execution produces no live side effects beyond local report/artifact
|
|
writes.
|
|
- Unknown/missing metrics are represented as null/`n/a`, not fake zero.
|
|
- Raw payloads and private paths are not persisted by default.
|
|
- Summary metrics include confidence buckets, fallback counts, NPU proof, and
|
|
authority/privacy violations.
|
|
- Promotion language says "candidate" or "discussion" only; no automatic live
|
|
authority is granted by a passing eval.
|