Files

T

William Valentin dae2a57124 feat(npu): add advisory dry-run comparison harness

Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.

2026-06-06 15:30:31 -07:00

17 KiB

Raw Blame History

NPU advisory decision schema and dry-run evaluation metrics

This document defines the compact npu_advisory_decision_v1 record and the minimum dry-run metrics required before any OpenVINO/NPU advisory lane is considered for promotion. The schema is advisory-only: it creates audit evidence and comparison data, not live authority.

Scope and safety defaults:

Local audit records only; no outbound sends, service restarts, tool execution, memory writes, routing changes, vector-store mutation, or broad private scans.
Synthetic or explicitly non-private fixtures only for dry-run evaluation.
Raw prompts, transcripts, documents, images, headers, secrets, and full upstream JSON payloads are not persisted by default.
NPU output is evidence for a gate. It must never directly perform or trigger an action.

`npu_advisory_decision_v1`

Required top-level fields:

Field	Type	Required	Notes
`schema_version`	string	yes	Always `npu_advisory_decision_v1`.
`decision_id`	string	yes	Locally generated UUID/ULID. No payload-derived PII.
`timestamp`	string	yes	RFC3339/ISO-8601 UTC timestamp.
`source`	object	yes	Where the dry-run input came from.
`service`	object	yes	Advisory lane/service that produced the recommendation.
`input_class`	string	yes	Normalized class such as `context_gate`, `cron_n8n_event`, `batch_doc_triage`, `voice_audio`, `kanban_hygiene`, or `advisory_gateway_envelope`.
`recommendation`	object	yes	NPU/advisory recommendation and rationale metadata.
`confidence`	object	yes	Score, bucket, and calibration notes.
`authority_flags`	object	yes	Explicit booleans for authority boundaries; all default false.
`allowed_actions`	array[string]	yes	Actions a downstream gate may consider. Defaults to advisory-only actions.
`actual_action`	object	yes	What really happened. In this gate it should always be no-op/record-only.
`human_or_atlas_decision`	object	yes	Comparison target from fixture expected label, human label, or Atlas decision.
`outcome`	object	yes	Agreement/error bucket used by the eval harness.
`npu_proof`	object	yes	Evidence that a real NPU-backed inference ran, where available.
`latency`	object	yes	Request latency and optional queue/processing timings.
`fallback`	object	yes	Whether CPU/offline/health-only fallback happened and why.
`privacy`	object	yes	What was redacted/hashed and what retention class applies.
`notes`	array[string]	no	Short non-private audit notes.

Field details

source:

kind: fixture, manual_label, atlas_shadow, human_review, or service_health_probe.
fixture_id: stable fixture identifier when applicable.
fixture_set: fixture collection name/version.
artifact_ref: optional local path or opaque run id; do not include raw private content.
content_hash: optional SHA-256 over sanitized fixture content.
privacy_class: synthetic, public, non_private, redacted, or private_disallowed.

service:

name: e.g. openvino_context_gate, cron_n8n_advisory, npu_batch_triage, npu_voice_audio_pipeline, kanban_hygiene_advisory, openvino_advisory_gateway.
endpoint: local endpoint label or script name; avoid sensitive URL params.
mode: dry_run, shadow, health_only, or offline_fixture.
model: optional model/backend label, if safe to log.

recommendation:

label: normalized recommendation, e.g. suppress, log, summarize, escalate, retrieve_more_context, skip_private_root, needs_human, no_action, or unknown.
severity: none, info, low, medium, high, or critical.
reasons: short non-private reason codes, not raw excerpts.
evidence_refs: bounded references to sanitized fixture fields or artifact ids.
raw_output_ref: optional local artifact pointer; default null.

confidence:

score: float from 0.0 to 1.0 when available, otherwise null.
bucket: one of very_low, low, medium, high, very_high, or unknown.
bucket_rule: the threshold rule used by the harness.
calibrated: boolean; false until enough labeled dry-run data exists.

Recommended confidence buckets:

Bucket	Score range	Gate behavior
`very_low`	`< 0.40`	Treat as uncertain; never escalate automatically.
`low`	`0.40-0.59`	Advisory note only; human/Atlas decides.
`medium`	`0.60-0.79`	Eligible for comparison metrics; no live action.
`high`	`0.80-0.94`	Strong advisory evidence; still gated.
`very_high`	`>= 0.95`	Promotion candidate only after repeated eval success.
`unknown`	null/missing	Count separately; do not coerce to zero.

authority_flags:

All flags default to false and must remain false for this gate.

can_route_atlas
can_write_memory
can_execute_tools
can_restart_services
can_send_outbound
can_scan_private_roots
can_mutate_vector_store
can_post_advisory_event
can_change_gateway_config
requires_human_approval
advisory_only

For this gate, advisory_only=true and requires_human_approval=true for any recommendation that could eventually affect live behavior.

allowed_actions:

Allowed by default:

record_metric
compare_with_expected_label
include_in_digest
open_review_ticket_candidate
recommend_human_review

Disallowed unless a later approval explicitly changes scope:

route_atlas
write_memory
execute_tool
restart_service
send_message
scan_private_root
mutate_vector_store
post_gateway_event

actual_action:

kind: should be none, recorded_metric, or dry_run_reported.
performed: boolean; false for live side effects in this gate.
performed_by: harness, human, atlas, or null.
side_effects: array; should be empty except local report/artifact writes.

human_or_atlas_decision:

source: fixture_expected, human_label, atlas_shadow, or missing.
label: normalized decision label using the same label set as recommendation.label when possible.
severity: normalized severity when applicable.
confidence: optional Atlas/human confidence if available.
decision_ref: optional review id, fixture id, or session/run id.
timestamp: optional timestamp for the comparison decision.

outcome:

comparison: agree, disagree, uncertain, missing_reference, or not_applicable.
error_type: null or one of false_positive, false_negative, severity_overcall, severity_undercall, unsafe_authority, privacy_violation, fallback_unexpected, latency_slo_miss, npu_proof_missing.
human_review_required: boolean.
promotion_blocker: boolean.

npu_proof:

proof_mode: sysfs_busy_delta, service_reported_delta, health_only, offline_fixture, or unavailable.
busy_delta_us: integer or null.
service_reported_delta_us: integer or null.
inference_ran: boolean.
proof_ok: boolean or null. Null means not measurable, not false.
counter_path: usually /sys/class/accel/accel0/device/npu_busy_time_us, if logged safely.

latency:

total_ms: end-to-end harness timing.
service_ms: service-reported processing time when available.
queue_ms: optional queue time.
timeout: boolean.

fallback:

occurred: boolean.
kind: null, cpu, offline, health_only, service_unavailable, skipped_cold_load, private_root_blocked, or proof_unavailable.
reason: short reason code.
expected: boolean. Expected fallbacks are counted but do not fail promotion unless their rate exceeds the threshold for that lane.

privacy:

payload_logged: must default false.
redaction: none_needed, hash_only, paths_only, metadata_only, or blocked_private.
retention: ephemeral, local_audit, or review_artifact.
contains_private_payload: must be false for committed fixtures.

Minimal JSON shape

{
  "schema_version": "npu_advisory_decision_v1",
  "decision_id": "01J00000000000000000000000",
  "timestamp": "2026-06-06T00:00:00Z",
  "source": {
    "kind": "fixture",
    "fixture_id": "cron_duplicate_success_001",
    "fixture_set": "npu_advisory_eval_v1",
    "artifact_ref": null,
    "content_hash": "sha256:example",
    "privacy_class": "synthetic"
  },
  "service": {
    "name": "cron_n8n_advisory",
    "endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh",
    "mode": "dry_run",
    "model": "openvino-local"
  },
  "input_class": "cron_n8n_event",
  "recommendation": {
    "label": "suppress",
    "severity": "info",
    "reasons": ["duplicate_success", "no_action_required"],
    "evidence_refs": ["fixture:event_kind", "fixture:status"],
    "raw_output_ref": null
  },
  "confidence": {
    "score": 0.91,
    "bucket": "high",
    "bucket_rule": "v1_default",
    "calibrated": false
  },
  "authority_flags": {
    "can_route_atlas": false,
    "can_write_memory": false,
    "can_execute_tools": false,
    "can_restart_services": false,
    "can_send_outbound": false,
    "can_scan_private_roots": false,
    "can_mutate_vector_store": false,
    "can_post_advisory_event": false,
    "can_change_gateway_config": false,
    "requires_human_approval": true,
    "advisory_only": true
  },
  "allowed_actions": [
    "record_metric",
    "compare_with_expected_label",
    "include_in_digest"
  ],
  "actual_action": {
    "kind": "dry_run_reported",
    "performed": false,
    "performed_by": "harness",
    "side_effects": []
  },
  "human_or_atlas_decision": {
    "source": "fixture_expected",
    "label": "suppress",
    "severity": "info",
    "confidence": null,
    "decision_ref": "cron_duplicate_success_001",
    "timestamp": null
  },
  "outcome": {
    "comparison": "agree",
    "error_type": null,
    "human_review_required": false,
    "promotion_blocker": false
  },
  "npu_proof": {
    "proof_mode": "sysfs_busy_delta",
    "busy_delta_us": 1200,
    "service_reported_delta_us": 1180,
    "inference_ran": true,
    "proof_ok": true,
    "counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us"
  },
  "latency": {
    "total_ms": 42.5,
    "service_ms": 39.1,
    "queue_ms": null,
    "timeout": false
  },
  "fallback": {
    "occurred": false,
    "kind": null,
    "reason": null,
    "expected": false
  },
  "privacy": {
    "payload_logged": false,
    "redaction": "metadata_only",
    "retention": "local_audit",
    "contains_private_payload": false
  },
  "notes": []
}

Dry-run comparison strategy

Each fixture or shadow input should produce one npu_advisory_decision_v1 record. The harness compares recommendation to human_or_atlas_decision in this order:

Use fixture_expected labels for synthetic/non-private regression fixtures.
Use explicit human_label for reviewed samples.
Use atlas_shadow only as a comparison signal, not ground truth, when a human label is unavailable.
Mark missing_reference rather than inventing a target decision.

Comparison categories:

agree: normalized label and severity are compatible.
disagree: label conflicts with the reference decision.
uncertain: NPU bucket is very_low, low, or unknown, or the service returned a deliberate needs_human/unknown label.
false_positive: NPU recommended escalation/action but reference says suppress/no-op.
false_negative: NPU recommended suppress/no-op but reference says escalate or action-needed.
severity_overcall / severity_undercall: label matches but severity differs by more than one level.

The summary should be grouped by lane (input_class and service.name) and by confidence bucket. Unknown metrics remain null/n/a; do not coerce missing data to zero.

Metrics

Minimum per-run metrics:

total_records
records_by_input_class
records_by_service
confidence_bucket_counts
recommendation_counts
authority_flag_violation_count
privacy_violation_count
actual_side_effect_count
agree_count, disagree_count, uncertain_count, missing_reference_count
false_positive_count, false_negative_count
severity_overcall_count, severity_undercall_count
fallback_count and fallback_counts_by_kind
expected_fallback_count vs unexpected_fallback_count
npu_proof_ok_count, npu_proof_missing_count, npu_proof_not_applicable_count
p50/p95 latency.total_ms by service and input class
timeout_count

Recommended derived rates:

agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)
uncertain_rate = uncertain / total_records
false_positive_rate = false_positive / comparable_records
false_negative_rate = false_negative / comparable_records
unsafe_authority_rate = authority_flag_violation_count / total_records
privacy_violation_rate = privacy_violation_count / total_records
unexpected_fallback_rate = unexpected_fallback_count / total_records
proof_ok_rate = npu_proof_ok_count / proof_required_records

Acceptance thresholds before future promotion

These thresholds are for considering a later, separately approved promotion. They do not grant authority by themselves.

Global blockers for every lane:

authority_flag_violation_count == 0.
actual_side_effect_count == 0 for dry-run harness execution.
privacy_violation_count == 0 and no committed private fixtures/secrets.
No raw private payloads in logs, reports, artifacts, or test fixtures.
No service bind, route, memory, tool, send, restart, or vector-store mutation introduced by the eval code.

Minimum data quality before promotion discussion:

At least 30 comparable synthetic/non-private records per lane, or all available lane fixtures if the lane is explicitly scoped smaller.
Every advisory lane has at least one normal case, one low-confidence case, one false-alarm/noise case, and one action-needed/escalation case.
missing_reference_count == 0 for promotion-candidate fixture sets.
Confidence bucket distribution is reported and stable across at least three dry-run executions.

Suggested metric thresholds:

Metric	Threshold for promotion discussion
Agreement rate	`>= 0.95` overall and `>= 0.90` per lane
False positive rate	`<= 0.03` overall and no repeated high-severity false positives
False negative rate	`<= 0.01` for action-needed/escalation cases
Uncertain rate	`<= 0.15` overall, unless lane is intentionally conservative
Unexpected fallback rate	`<= 0.02` and every fallback has a reason code
NPU proof OK rate	`>= 0.98` for proof-required lanes
p95 latency	Within the lane-specific SLO documented by the implementation task
Authority/privacy violations	exactly `0`

Promotion remains lane-specific. A passing context-gate eval does not promote cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes. Each lane needs its own human-approved scope, rollback plan, and review.

Output formats

The dry-run harness should emit:

JSONL decisions: one npu_advisory_decision_v1 object per line.
Compact JSON summary: aggregate counts/rates for dashboards and follow-up digest scripts.
Compact Markdown/text summary: suitable for terminal, Telegram, or Discord.

The Markdown/text summary should include:

run id, fixture set, generated-at timestamp;
records by lane/service;
agreement/uncertain/false-positive/false-negative counts;
confidence bucket distribution;
fallback counts;
NPU proof counts;
authority/privacy violation counts;
promotion blockers and caveats.

Fixture expectations

Use synthetic/non-private fixtures only. Required lanes:

context_gate: retrieve/no-retrieve decisions with missing, conflicting, and sufficient context cases.
cron_n8n_event: duplicate success, stale warning, urgent false alarm, and action-needed failure.
batch_doc_triage: private-root blocked, approved synthetic sample, noisy OCR, and needs-human cases.
voice_audio: bounded generated audio, low-confidence transcript, harmless background noise, and action-needed command-like utterance that must not execute.
kanban_hygiene: no-op healthy card, stale/card-needs-review, false alarm, and action-needed label.
advisory_gateway_envelope: valid classify/generate/triage envelope examples plus malformed/unsafe authority-request examples.

Any fixture that resembles private content should be replaced with a synthetic fixture or reduced to metadata/hash-only form before committing.

Review checklist

Before implementation or docs depending on this spec are accepted, verify:

schema_version is present and all authority flags default closed.
Dry-run execution produces no live side effects beyond local report/artifact writes.
Unknown/missing metrics are represented as null/n/a, not fake zero.
Raw payloads and private paths are not persisted by default.
Summary metrics include confidence buckets, fallback counts, NPU proof, and authority/privacy violations.
Promotion language says "candidate" or "discussion" only; no automatic live authority is granted by a passing eval.

17 KiB Raw Blame History