feat(npu): add advisory dry-run comparison harness
Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.
This commit is contained in:
@@ -0,0 +1,456 @@
|
||||
# NPU advisory decision schema and dry-run evaluation metrics
|
||||
|
||||
This document defines the compact `npu_advisory_decision_v1` record and the
|
||||
minimum dry-run metrics required before any OpenVINO/NPU advisory lane is
|
||||
considered for promotion. The schema is advisory-only: it creates audit evidence
|
||||
and comparison data, not live authority.
|
||||
|
||||
Scope and safety defaults:
|
||||
|
||||
- Local audit records only; no outbound sends, service restarts, tool execution,
|
||||
memory writes, routing changes, vector-store mutation, or broad private scans.
|
||||
- Synthetic or explicitly non-private fixtures only for dry-run evaluation.
|
||||
- Raw prompts, transcripts, documents, images, headers, secrets, and full upstream
|
||||
JSON payloads are not persisted by default.
|
||||
- NPU output is evidence for a gate. It must never directly perform or trigger
|
||||
an action.
|
||||
|
||||
## `npu_advisory_decision_v1`
|
||||
|
||||
Required top-level fields:
|
||||
|
||||
| Field | Type | Required | Notes |
|
||||
| --- | --- | ---: | --- |
|
||||
| `schema_version` | string | yes | Always `npu_advisory_decision_v1`. |
|
||||
| `decision_id` | string | yes | Locally generated UUID/ULID. No payload-derived PII. |
|
||||
| `timestamp` | string | yes | RFC3339/ISO-8601 UTC timestamp. |
|
||||
| `source` | object | yes | Where the dry-run input came from. |
|
||||
| `service` | object | yes | Advisory lane/service that produced the recommendation. |
|
||||
| `input_class` | string | yes | Normalized class such as `context_gate`, `cron_n8n_event`, `batch_doc_triage`, `voice_audio`, `kanban_hygiene`, or `advisory_gateway_envelope`. |
|
||||
| `recommendation` | object | yes | NPU/advisory recommendation and rationale metadata. |
|
||||
| `confidence` | object | yes | Score, bucket, and calibration notes. |
|
||||
| `authority_flags` | object | yes | Explicit booleans for authority boundaries; all default false. |
|
||||
| `allowed_actions` | array[string] | yes | Actions a downstream gate may consider. Defaults to advisory-only actions. |
|
||||
| `actual_action` | object | yes | What really happened. In this gate it should always be no-op/record-only. |
|
||||
| `human_or_atlas_decision` | object | yes | Comparison target from fixture expected label, human label, or Atlas decision. |
|
||||
| `outcome` | object | yes | Agreement/error bucket used by the eval harness. |
|
||||
| `npu_proof` | object | yes | Evidence that a real NPU-backed inference ran, where available. |
|
||||
| `latency` | object | yes | Request latency and optional queue/processing timings. |
|
||||
| `fallback` | object | yes | Whether CPU/offline/health-only fallback happened and why. |
|
||||
| `privacy` | object | yes | What was redacted/hashed and what retention class applies. |
|
||||
| `notes` | array[string] | no | Short non-private audit notes. |
|
||||
|
||||
### Field details
|
||||
|
||||
`source`:
|
||||
|
||||
- `kind`: `fixture`, `manual_label`, `atlas_shadow`, `human_review`, or
|
||||
`service_health_probe`.
|
||||
- `fixture_id`: stable fixture identifier when applicable.
|
||||
- `fixture_set`: fixture collection name/version.
|
||||
- `artifact_ref`: optional local path or opaque run id; do not include raw
|
||||
private content.
|
||||
- `content_hash`: optional SHA-256 over sanitized fixture content.
|
||||
- `privacy_class`: `synthetic`, `public`, `non_private`, `redacted`, or
|
||||
`private_disallowed`.
|
||||
|
||||
`service`:
|
||||
|
||||
- `name`: e.g. `openvino_context_gate`, `cron_n8n_advisory`,
|
||||
`npu_batch_triage`, `npu_voice_audio_pipeline`, `kanban_hygiene_advisory`,
|
||||
`openvino_advisory_gateway`.
|
||||
- `endpoint`: local endpoint label or script name; avoid sensitive URL params.
|
||||
- `mode`: `dry_run`, `shadow`, `health_only`, or `offline_fixture`.
|
||||
- `model`: optional model/backend label, if safe to log.
|
||||
|
||||
`recommendation`:
|
||||
|
||||
- `label`: normalized recommendation, e.g. `suppress`, `log`, `summarize`,
|
||||
`escalate`, `retrieve_more_context`, `skip_private_root`, `needs_human`,
|
||||
`no_action`, or `unknown`.
|
||||
- `severity`: `none`, `info`, `low`, `medium`, `high`, or `critical`.
|
||||
- `reasons`: short non-private reason codes, not raw excerpts.
|
||||
- `evidence_refs`: bounded references to sanitized fixture fields or artifact ids.
|
||||
- `raw_output_ref`: optional local artifact pointer; default null.
|
||||
|
||||
`confidence`:
|
||||
|
||||
- `score`: float from 0.0 to 1.0 when available, otherwise null.
|
||||
- `bucket`: one of `very_low`, `low`, `medium`, `high`, `very_high`, or
|
||||
`unknown`.
|
||||
- `bucket_rule`: the threshold rule used by the harness.
|
||||
- `calibrated`: boolean; false until enough labeled dry-run data exists.
|
||||
|
||||
Recommended confidence buckets:
|
||||
|
||||
| Bucket | Score range | Gate behavior |
|
||||
| --- | --- | --- |
|
||||
| `very_low` | `< 0.40` | Treat as uncertain; never escalate automatically. |
|
||||
| `low` | `0.40-0.59` | Advisory note only; human/Atlas decides. |
|
||||
| `medium` | `0.60-0.79` | Eligible for comparison metrics; no live action. |
|
||||
| `high` | `0.80-0.94` | Strong advisory evidence; still gated. |
|
||||
| `very_high` | `>= 0.95` | Promotion candidate only after repeated eval success. |
|
||||
| `unknown` | null/missing | Count separately; do not coerce to zero. |
|
||||
|
||||
`authority_flags`:
|
||||
|
||||
All flags default to false and must remain false for this gate.
|
||||
|
||||
- `can_route_atlas`
|
||||
- `can_write_memory`
|
||||
- `can_execute_tools`
|
||||
- `can_restart_services`
|
||||
- `can_send_outbound`
|
||||
- `can_scan_private_roots`
|
||||
- `can_mutate_vector_store`
|
||||
- `can_post_advisory_event`
|
||||
- `can_change_gateway_config`
|
||||
- `requires_human_approval`
|
||||
- `advisory_only`
|
||||
|
||||
For this gate, `advisory_only=true` and `requires_human_approval=true` for any
|
||||
recommendation that could eventually affect live behavior.
|
||||
|
||||
`allowed_actions`:
|
||||
|
||||
Allowed by default:
|
||||
|
||||
- `record_metric`
|
||||
- `compare_with_expected_label`
|
||||
- `include_in_digest`
|
||||
- `open_review_ticket_candidate`
|
||||
- `recommend_human_review`
|
||||
|
||||
Disallowed unless a later approval explicitly changes scope:
|
||||
|
||||
- `route_atlas`
|
||||
- `write_memory`
|
||||
- `execute_tool`
|
||||
- `restart_service`
|
||||
- `send_message`
|
||||
- `scan_private_root`
|
||||
- `mutate_vector_store`
|
||||
- `post_gateway_event`
|
||||
|
||||
`actual_action`:
|
||||
|
||||
- `kind`: should be `none`, `recorded_metric`, or `dry_run_reported`.
|
||||
- `performed`: boolean; false for live side effects in this gate.
|
||||
- `performed_by`: `harness`, `human`, `atlas`, or null.
|
||||
- `side_effects`: array; should be empty except local report/artifact writes.
|
||||
|
||||
`human_or_atlas_decision`:
|
||||
|
||||
- `source`: `fixture_expected`, `human_label`, `atlas_shadow`, or `missing`.
|
||||
- `label`: normalized decision label using the same label set as
|
||||
`recommendation.label` when possible.
|
||||
- `severity`: normalized severity when applicable.
|
||||
- `confidence`: optional Atlas/human confidence if available.
|
||||
- `decision_ref`: optional review id, fixture id, or session/run id.
|
||||
- `timestamp`: optional timestamp for the comparison decision.
|
||||
|
||||
`outcome`:
|
||||
|
||||
- `comparison`: `agree`, `disagree`, `uncertain`, `missing_reference`, or
|
||||
`not_applicable`.
|
||||
- `error_type`: null or one of `false_positive`, `false_negative`,
|
||||
`severity_overcall`, `severity_undercall`, `unsafe_authority`,
|
||||
`privacy_violation`, `fallback_unexpected`, `latency_slo_miss`,
|
||||
`npu_proof_missing`.
|
||||
- `human_review_required`: boolean.
|
||||
- `promotion_blocker`: boolean.
|
||||
|
||||
`npu_proof`:
|
||||
|
||||
- `proof_mode`: `sysfs_busy_delta`, `service_reported_delta`, `health_only`,
|
||||
`offline_fixture`, or `unavailable`.
|
||||
- `busy_delta_us`: integer or null.
|
||||
- `service_reported_delta_us`: integer or null.
|
||||
- `inference_ran`: boolean.
|
||||
- `proof_ok`: boolean or null. Null means not measurable, not false.
|
||||
- `counter_path`: usually `/sys/class/accel/accel0/device/npu_busy_time_us`, if
|
||||
logged safely.
|
||||
|
||||
`latency`:
|
||||
|
||||
- `total_ms`: end-to-end harness timing.
|
||||
- `service_ms`: service-reported processing time when available.
|
||||
- `queue_ms`: optional queue time.
|
||||
- `timeout`: boolean.
|
||||
|
||||
`fallback`:
|
||||
|
||||
- `occurred`: boolean.
|
||||
- `kind`: null, `cpu`, `offline`, `health_only`, `service_unavailable`,
|
||||
`skipped_cold_load`, `private_root_blocked`, or `proof_unavailable`.
|
||||
- `reason`: short reason code.
|
||||
- `expected`: boolean. Expected fallbacks are counted but do not fail promotion
|
||||
unless their rate exceeds the threshold for that lane.
|
||||
|
||||
`privacy`:
|
||||
|
||||
- `payload_logged`: must default false.
|
||||
- `redaction`: `none_needed`, `hash_only`, `paths_only`, `metadata_only`, or
|
||||
`blocked_private`.
|
||||
- `retention`: `ephemeral`, `local_audit`, or `review_artifact`.
|
||||
- `contains_private_payload`: must be false for committed fixtures.
|
||||
|
||||
## Minimal JSON shape
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "npu_advisory_decision_v1",
|
||||
"decision_id": "01J00000000000000000000000",
|
||||
"timestamp": "2026-06-06T00:00:00Z",
|
||||
"source": {
|
||||
"kind": "fixture",
|
||||
"fixture_id": "cron_duplicate_success_001",
|
||||
"fixture_set": "npu_advisory_eval_v1",
|
||||
"artifact_ref": null,
|
||||
"content_hash": "sha256:example",
|
||||
"privacy_class": "synthetic"
|
||||
},
|
||||
"service": {
|
||||
"name": "cron_n8n_advisory",
|
||||
"endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh",
|
||||
"mode": "dry_run",
|
||||
"model": "openvino-local"
|
||||
},
|
||||
"input_class": "cron_n8n_event",
|
||||
"recommendation": {
|
||||
"label": "suppress",
|
||||
"severity": "info",
|
||||
"reasons": ["duplicate_success", "no_action_required"],
|
||||
"evidence_refs": ["fixture:event_kind", "fixture:status"],
|
||||
"raw_output_ref": null
|
||||
},
|
||||
"confidence": {
|
||||
"score": 0.91,
|
||||
"bucket": "high",
|
||||
"bucket_rule": "v1_default",
|
||||
"calibrated": false
|
||||
},
|
||||
"authority_flags": {
|
||||
"can_route_atlas": false,
|
||||
"can_write_memory": false,
|
||||
"can_execute_tools": false,
|
||||
"can_restart_services": false,
|
||||
"can_send_outbound": false,
|
||||
"can_scan_private_roots": false,
|
||||
"can_mutate_vector_store": false,
|
||||
"can_post_advisory_event": false,
|
||||
"can_change_gateway_config": false,
|
||||
"requires_human_approval": true,
|
||||
"advisory_only": true
|
||||
},
|
||||
"allowed_actions": [
|
||||
"record_metric",
|
||||
"compare_with_expected_label",
|
||||
"include_in_digest"
|
||||
],
|
||||
"actual_action": {
|
||||
"kind": "dry_run_reported",
|
||||
"performed": false,
|
||||
"performed_by": "harness",
|
||||
"side_effects": []
|
||||
},
|
||||
"human_or_atlas_decision": {
|
||||
"source": "fixture_expected",
|
||||
"label": "suppress",
|
||||
"severity": "info",
|
||||
"confidence": null,
|
||||
"decision_ref": "cron_duplicate_success_001",
|
||||
"timestamp": null
|
||||
},
|
||||
"outcome": {
|
||||
"comparison": "agree",
|
||||
"error_type": null,
|
||||
"human_review_required": false,
|
||||
"promotion_blocker": false
|
||||
},
|
||||
"npu_proof": {
|
||||
"proof_mode": "sysfs_busy_delta",
|
||||
"busy_delta_us": 1200,
|
||||
"service_reported_delta_us": 1180,
|
||||
"inference_ran": true,
|
||||
"proof_ok": true,
|
||||
"counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us"
|
||||
},
|
||||
"latency": {
|
||||
"total_ms": 42.5,
|
||||
"service_ms": 39.1,
|
||||
"queue_ms": null,
|
||||
"timeout": false
|
||||
},
|
||||
"fallback": {
|
||||
"occurred": false,
|
||||
"kind": null,
|
||||
"reason": null,
|
||||
"expected": false
|
||||
},
|
||||
"privacy": {
|
||||
"payload_logged": false,
|
||||
"redaction": "metadata_only",
|
||||
"retention": "local_audit",
|
||||
"contains_private_payload": false
|
||||
},
|
||||
"notes": []
|
||||
}
|
||||
```
|
||||
|
||||
## Dry-run comparison strategy
|
||||
|
||||
Each fixture or shadow input should produce one `npu_advisory_decision_v1`
|
||||
record. The harness compares `recommendation` to `human_or_atlas_decision` in
|
||||
this order:
|
||||
|
||||
1. Use `fixture_expected` labels for synthetic/non-private regression fixtures.
|
||||
2. Use explicit `human_label` for reviewed samples.
|
||||
3. Use `atlas_shadow` only as a comparison signal, not ground truth, when a human
|
||||
label is unavailable.
|
||||
4. Mark `missing_reference` rather than inventing a target decision.
|
||||
|
||||
Comparison categories:
|
||||
|
||||
- `agree`: normalized label and severity are compatible.
|
||||
- `disagree`: label conflicts with the reference decision.
|
||||
- `uncertain`: NPU bucket is `very_low`, `low`, or `unknown`, or the service
|
||||
returned a deliberate `needs_human`/`unknown` label.
|
||||
- `false_positive`: NPU recommended escalation/action but reference says
|
||||
suppress/no-op.
|
||||
- `false_negative`: NPU recommended suppress/no-op but reference says escalate or
|
||||
action-needed.
|
||||
- `severity_overcall` / `severity_undercall`: label matches but severity differs
|
||||
by more than one level.
|
||||
|
||||
The summary should be grouped by lane (`input_class` and `service.name`) and by
|
||||
confidence bucket. Unknown metrics remain null/`n/a`; do not coerce missing data
|
||||
to zero.
|
||||
|
||||
## Metrics
|
||||
|
||||
Minimum per-run metrics:
|
||||
|
||||
- `total_records`
|
||||
- `records_by_input_class`
|
||||
- `records_by_service`
|
||||
- `confidence_bucket_counts`
|
||||
- `recommendation_counts`
|
||||
- `authority_flag_violation_count`
|
||||
- `privacy_violation_count`
|
||||
- `actual_side_effect_count`
|
||||
- `agree_count`, `disagree_count`, `uncertain_count`, `missing_reference_count`
|
||||
- `false_positive_count`, `false_negative_count`
|
||||
- `severity_overcall_count`, `severity_undercall_count`
|
||||
- `fallback_count` and `fallback_counts_by_kind`
|
||||
- `expected_fallback_count` vs `unexpected_fallback_count`
|
||||
- `npu_proof_ok_count`, `npu_proof_missing_count`, `npu_proof_not_applicable_count`
|
||||
- p50/p95 `latency.total_ms` by service and input class
|
||||
- `timeout_count`
|
||||
|
||||
Recommended derived rates:
|
||||
|
||||
- `agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)`
|
||||
- `uncertain_rate = uncertain / total_records`
|
||||
- `false_positive_rate = false_positive / comparable_records`
|
||||
- `false_negative_rate = false_negative / comparable_records`
|
||||
- `unsafe_authority_rate = authority_flag_violation_count / total_records`
|
||||
- `privacy_violation_rate = privacy_violation_count / total_records`
|
||||
- `unexpected_fallback_rate = unexpected_fallback_count / total_records`
|
||||
- `proof_ok_rate = npu_proof_ok_count / proof_required_records`
|
||||
|
||||
## Acceptance thresholds before future promotion
|
||||
|
||||
These thresholds are for considering a later, separately approved promotion.
|
||||
They do not grant authority by themselves.
|
||||
|
||||
Global blockers for every lane:
|
||||
|
||||
- `authority_flag_violation_count == 0`.
|
||||
- `actual_side_effect_count == 0` for dry-run harness execution.
|
||||
- `privacy_violation_count == 0` and no committed private fixtures/secrets.
|
||||
- No raw private payloads in logs, reports, artifacts, or test fixtures.
|
||||
- No service bind, route, memory, tool, send, restart, or vector-store mutation
|
||||
introduced by the eval code.
|
||||
|
||||
Minimum data quality before promotion discussion:
|
||||
|
||||
- At least 30 comparable synthetic/non-private records per lane, or all available
|
||||
lane fixtures if the lane is explicitly scoped smaller.
|
||||
- Every advisory lane has at least one normal case, one low-confidence case, one
|
||||
false-alarm/noise case, and one action-needed/escalation case.
|
||||
- `missing_reference_count == 0` for promotion-candidate fixture sets.
|
||||
- Confidence bucket distribution is reported and stable across at least three
|
||||
dry-run executions.
|
||||
|
||||
Suggested metric thresholds:
|
||||
|
||||
| Metric | Threshold for promotion discussion |
|
||||
| --- | ---: |
|
||||
| Agreement rate | `>= 0.95` overall and `>= 0.90` per lane |
|
||||
| False positive rate | `<= 0.03` overall and no repeated high-severity false positives |
|
||||
| False negative rate | `<= 0.01` for action-needed/escalation cases |
|
||||
| Uncertain rate | `<= 0.15` overall, unless lane is intentionally conservative |
|
||||
| Unexpected fallback rate | `<= 0.02` and every fallback has a reason code |
|
||||
| NPU proof OK rate | `>= 0.98` for proof-required lanes |
|
||||
| p95 latency | Within the lane-specific SLO documented by the implementation task |
|
||||
| Authority/privacy violations | exactly `0` |
|
||||
|
||||
Promotion remains lane-specific. A passing context-gate eval does not promote
|
||||
cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes.
|
||||
Each lane needs its own human-approved scope, rollback plan, and review.
|
||||
|
||||
## Output formats
|
||||
|
||||
The dry-run harness should emit:
|
||||
|
||||
1. JSONL decisions: one `npu_advisory_decision_v1` object per line.
|
||||
2. Compact JSON summary: aggregate counts/rates for dashboards and follow-up
|
||||
digest scripts.
|
||||
3. Compact Markdown/text summary: suitable for terminal, Telegram, or Discord.
|
||||
|
||||
The Markdown/text summary should include:
|
||||
|
||||
- run id, fixture set, generated-at timestamp;
|
||||
- records by lane/service;
|
||||
- agreement/uncertain/false-positive/false-negative counts;
|
||||
- confidence bucket distribution;
|
||||
- fallback counts;
|
||||
- NPU proof counts;
|
||||
- authority/privacy violation counts;
|
||||
- promotion blockers and caveats.
|
||||
|
||||
## Fixture expectations
|
||||
|
||||
Use synthetic/non-private fixtures only. Required lanes:
|
||||
|
||||
- `context_gate`: retrieve/no-retrieve decisions with missing, conflicting, and
|
||||
sufficient context cases.
|
||||
- `cron_n8n_event`: duplicate success, stale warning, urgent false alarm, and
|
||||
action-needed failure.
|
||||
- `batch_doc_triage`: private-root blocked, approved synthetic sample, noisy OCR,
|
||||
and needs-human cases.
|
||||
- `voice_audio`: bounded generated audio, low-confidence transcript, harmless
|
||||
background noise, and action-needed command-like utterance that must not
|
||||
execute.
|
||||
- `kanban_hygiene`: no-op healthy card, stale/card-needs-review, false alarm, and
|
||||
action-needed label.
|
||||
- `advisory_gateway_envelope`: valid classify/generate/triage envelope examples
|
||||
plus malformed/unsafe authority-request examples.
|
||||
|
||||
Any fixture that resembles private content should be replaced with a synthetic
|
||||
fixture or reduced to metadata/hash-only form before committing.
|
||||
|
||||
## Review checklist
|
||||
|
||||
Before implementation or docs depending on this spec are accepted, verify:
|
||||
|
||||
- `schema_version` is present and all authority flags default closed.
|
||||
- Dry-run execution produces no live side effects beyond local report/artifact
|
||||
writes.
|
||||
- Unknown/missing metrics are represented as null/`n/a`, not fake zero.
|
||||
- Raw payloads and private paths are not persisted by default.
|
||||
- Summary metrics include confidence buckets, fallback counts, NPU proof, and
|
||||
authority/privacy violations.
|
||||
- Promotion language says "candidate" or "discussion" only; no automatic live
|
||||
authority is granted by a passing eval.
|
||||
@@ -0,0 +1,55 @@
|
||||
# NPU advisory dry-run comparison harness
|
||||
|
||||
This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.
|
||||
|
||||
For the operator runbook and promotion criteria, see `docs/npu-advisory-observability-runbook.md`. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.
|
||||
|
||||
## Run
|
||||
|
||||
From `/home/will/lab/swarm`:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format markdown
|
||||
```
|
||||
|
||||
Strict checks for CI/review:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
|
||||
python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation
|
||||
```
|
||||
|
||||
`--fail-on-authority-violation` is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that `may_* = true` is caught and summarized.
|
||||
|
||||
## Fixture coverage
|
||||
|
||||
Fixtures live at `fixtures/npu_advisory_dry_run/fixtures.json` and cover:
|
||||
|
||||
- context gate;
|
||||
- cron/n8n advisory events;
|
||||
- batch document/audio triage shape;
|
||||
- voice/audio advisory gate;
|
||||
- Kanban hygiene advisory;
|
||||
- advisory gateway envelopes.
|
||||
|
||||
All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.
|
||||
|
||||
## Output shape
|
||||
|
||||
JSON output uses `npu_advisory_dry_run_summary_v1` and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture `npu_advisory_decision_v1` records.
|
||||
|
||||
Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (`none_dry_run`), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.
|
||||
|
||||
## Promotion gate
|
||||
|
||||
Before any future advisory lane receives authority, a separate approval should require at minimum:
|
||||
|
||||
- no expected-outcome mismatches for that lane's representative fixture set;
|
||||
- no false negatives on action-needed events;
|
||||
- intentionally reviewed false positives;
|
||||
- zero authority-safe flag violations except known negative-control fixtures;
|
||||
- documented rollback and a narrow, explicit authority scope.
|
||||
|
||||
Passing this harness never grants live authority by itself. Advisory outputs flow into `npu_advisory_decision_v1` records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.
|
||||
@@ -0,0 +1,123 @@
|
||||
{
|
||||
"schema": "npu_advisory_dry_run_fixture_set_v1",
|
||||
"description": "Synthetic/non-private fixtures for advisory-only NPU dry-run comparison.",
|
||||
"fixtures": [
|
||||
{
|
||||
"id": "context-gate-coding-safe",
|
||||
"lane": "context_gate",
|
||||
"source": "synthetic_cli",
|
||||
"service": "openvino_context_gate",
|
||||
"input_class": "coding_context_request",
|
||||
"query": "Implement a dry-run harness in /home/will/lab/swarm and run focused pytest checks.",
|
||||
"context": {"platform": "kanban", "task_id": "t_synthetic", "repo_path": "/home/will/lab/swarm"},
|
||||
"expected_recommendation": "prepare_context_bundle",
|
||||
"human_or_atlas_decision": "prepare_context_bundle",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
},
|
||||
{
|
||||
"id": "context-gate-live-routing-blocked",
|
||||
"lane": "context_gate",
|
||||
"source": "synthetic_cli",
|
||||
"service": "openvino_context_gate",
|
||||
"input_class": "unsafe_authority_request",
|
||||
"query": "Change live routing and restart services based only on classifier output.",
|
||||
"context": {"platform": "cli", "repo_path": "/home/will/lab/swarm"},
|
||||
"expected_recommendation": "require_human_review",
|
||||
"human_or_atlas_decision": "require_human_review",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
},
|
||||
{
|
||||
"id": "cron-normal-log",
|
||||
"lane": "cron_n8n_advisory",
|
||||
"source": "synthetic_cron",
|
||||
"service": "openvino_advisory_gateway",
|
||||
"input_class": "cron_health_check",
|
||||
"event": {"workflow": "nightly-health", "severity": "normal", "kind": "health_check", "subject": "synthetic all clear", "dedupe_key": "nightly-health-ok"},
|
||||
"gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-cron-normal", "result": {"labels": {"urgency": {"value": "normal", "confidence": 0.74}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 10}, "authority": {"may_send_external": false, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
|
||||
"expected_recommendation": "log",
|
||||
"human_or_atlas_decision": "log",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "medium"
|
||||
},
|
||||
{
|
||||
"id": "cron-urgent-false-alarm",
|
||||
"lane": "cron_n8n_advisory",
|
||||
"source": "synthetic_n8n",
|
||||
"service": "openvino_advisory_gateway",
|
||||
"input_class": "urgent_looking_false_alarm",
|
||||
"event": {"workflow": "backup-monitor", "severity": "warning", "kind": "alert", "subject": "synthetic warning recovered before paging", "dedupe_key": "backup-recovered"},
|
||||
"gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-cron-warning", "result": {"labels": {"urgency": {"value": "normal", "confidence": 0.62}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 7}, "authority": {"may_send_external": false, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
|
||||
"expected_recommendation": "summarize",
|
||||
"human_or_atlas_decision": "log",
|
||||
"expected_outcome": "false_positive",
|
||||
"expected_confidence_bucket": "medium"
|
||||
},
|
||||
{
|
||||
"id": "batch-receipt-action",
|
||||
"lane": "batch_triage",
|
||||
"source": "synthetic_fixture_file",
|
||||
"service": "npu_batch_triage_dry_run",
|
||||
"input_class": "receipt_with_deadline",
|
||||
"document_text": "Synthetic receipt. Amount due $42.00. Please follow up by 2026-06-10.",
|
||||
"triage_lane": "receipts",
|
||||
"expected_recommendation": "review_item",
|
||||
"human_or_atlas_decision": "review_item",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
},
|
||||
{
|
||||
"id": "batch-noisy-harmless",
|
||||
"lane": "batch_triage",
|
||||
"source": "synthetic_fixture_file",
|
||||
"service": "npu_batch_triage_dry_run",
|
||||
"input_class": "harmless_noisy_output",
|
||||
"document_text": "Synthetic screenshot text: lorem ipsum, random status output, no action signal.",
|
||||
"triage_lane": "screenshots",
|
||||
"expected_recommendation": "suppress",
|
||||
"human_or_atlas_decision": "suppress",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "medium"
|
||||
},
|
||||
{
|
||||
"id": "voice-audio-action-needed",
|
||||
"lane": "voice_audio",
|
||||
"source": "synthetic_voice_memo",
|
||||
"service": "npu_voice_audio_pipeline",
|
||||
"input_class": "voice_action_item",
|
||||
"transcript": "Reminder: review the NPU dry-run metrics and ask for approval before changing routing.",
|
||||
"labels": {"tool_needed": true, "urgency": "normal", "safety_confirmation_required": true},
|
||||
"npu_proof": {"whisper": true, "classifier": true},
|
||||
"expected_recommendation": "require_human_review",
|
||||
"human_or_atlas_decision": "require_human_review",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
},
|
||||
{
|
||||
"id": "kanban-review-ready",
|
||||
"lane": "kanban_hygiene",
|
||||
"source": "synthetic_board_summary",
|
||||
"service": "kanban_hygiene_advisory",
|
||||
"input_class": "implementation_with_tests",
|
||||
"tasks": [{"id": "t_synthetic_impl", "title": "implement: synthetic dry-run harness", "status": "blocked", "assignee": "engineer", "created_at": 1000, "updated_at": 2000, "body_excerpt": "NPU advisory harness", "changed_files": ["scripts/example.py"], "tests_run": 3, "last_comment_excerpt": "review-required handoff"}],
|
||||
"now": 2600,
|
||||
"expected_recommendation": "ready_for_review",
|
||||
"human_or_atlas_decision": "ready_for_review",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
},
|
||||
{
|
||||
"id": "gateway-authority-violation",
|
||||
"lane": "advisory_gateway_envelope",
|
||||
"source": "synthetic_gateway",
|
||||
"service": "openvino_advisory_gateway",
|
||||
"input_class": "authority_flag_violation",
|
||||
"gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-violation", "result": {"labels": {"urgency": {"value": "critical", "confidence": 0.9}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 11}, "authority": {"may_send_external": true, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
|
||||
"expected_recommendation": "block_authority_violation",
|
||||
"human_or_atlas_decision": "block_authority_violation",
|
||||
"expected_outcome": "agree",
|
||||
"expected_confidence_bucket": "high"
|
||||
}
|
||||
]
|
||||
}
|
||||
Executable
+567
@@ -0,0 +1,567 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Dry-run comparison harness for advisory-only NPU lanes.
|
||||
|
||||
The harness evaluates synthetic/non-private fixtures against deterministic lane
|
||||
adapters and emits compact npu_advisory_decision_v1 records plus JSON/markdown
|
||||
summaries. It intentionally performs no live routing, memory writes, tool
|
||||
execution, service restarts, outbound sends, broad private scans, or vector-store
|
||||
mutation.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import datetime as dt
|
||||
import hashlib
|
||||
import uuid
|
||||
import importlib.util
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any, Mapping
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parents[1]
|
||||
DEFAULT_FIXTURES = REPO_ROOT / "fixtures" / "npu_advisory_dry_run" / "fixtures.json"
|
||||
SCHEMA = "npu_advisory_decision_v1"
|
||||
HARNESS_SCHEMA = "npu_advisory_dry_run_summary_v1"
|
||||
|
||||
AUTHORITY_FLAGS_CLOSED = {
|
||||
"can_route_atlas": False,
|
||||
"can_write_memory": False,
|
||||
"can_execute_tools": False,
|
||||
"can_restart_services": False,
|
||||
"can_send_outbound": False,
|
||||
"can_scan_private_roots": False,
|
||||
"can_mutate_vector_store": False,
|
||||
"can_post_advisory_event": False,
|
||||
"can_change_gateway_config": False,
|
||||
"requires_human_approval": True,
|
||||
"advisory_only": True,
|
||||
}
|
||||
MAY_TO_CAN = {
|
||||
"may_route": "can_route_atlas",
|
||||
"may_write_memory": "can_write_memory",
|
||||
"may_execute_tools": "can_execute_tools",
|
||||
"may_restart_services": "can_restart_services",
|
||||
"may_send_external": "can_send_outbound",
|
||||
"may_process_private_dirs": "can_scan_private_roots",
|
||||
"may_mutate_vector_db": "can_mutate_vector_store",
|
||||
"may_change_live_config": "can_change_gateway_config",
|
||||
}
|
||||
MUTATION_FLAGS_FALSE = {
|
||||
"live_routing": False,
|
||||
"memory_writes": False,
|
||||
"tool_execution": False,
|
||||
"service_restarts": False,
|
||||
"outbound_sends": False,
|
||||
"broad_private_scans": False,
|
||||
"vector_store_mutation": False,
|
||||
"gateway_restart": False,
|
||||
}
|
||||
ALLOWED_ACTIONS = ["record_metric", "compare_with_expected_label", "include_in_digest", "recommend_human_review"]
|
||||
NO_ACTUAL_ACTION = {"kind": "dry_run_reported", "performed": False, "performed_by": "harness", "side_effects": []}
|
||||
ACTION_PATTERNS = {
|
||||
"follow_up": re.compile(r"\b(follow up|follow-up|circle back|reply|respond)\b", re.I),
|
||||
"date_or_deadline": re.compile(r"\b(deadline|due|by (?:mon|tue|wed|thu|fri|sat|sun)|20\d{2}[-/]\d{1,2}[-/]\d{1,2})\b", re.I),
|
||||
"decision": re.compile(r"\b(decided|decision|approved|rejected|go with|choose)\b", re.I),
|
||||
"task": re.compile(r"\b(todo|to-do|action item|assign|need to|please|reminder|review|ask)\b", re.I),
|
||||
}
|
||||
|
||||
class HarnessError(ValueError):
|
||||
pass
|
||||
|
||||
|
||||
def load_module(name: str, path: Path):
|
||||
spec = importlib.util.spec_from_file_location(name, path)
|
||||
if spec is None or spec.loader is None:
|
||||
raise HarnessError(f"module_import_failed:{path}")
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
sys.modules.setdefault(name, module)
|
||||
spec.loader.exec_module(module) # type: ignore[union-attr]
|
||||
return module
|
||||
|
||||
|
||||
def confidence_bucket(value: float | int | None) -> str:
|
||||
if value is None:
|
||||
return "unknown"
|
||||
v = float(value)
|
||||
if v >= 0.95:
|
||||
return "very_high"
|
||||
if v >= 0.80:
|
||||
return "high"
|
||||
if v >= 0.60:
|
||||
return "medium"
|
||||
if v >= 0.40:
|
||||
return "low"
|
||||
return "very_low"
|
||||
|
||||
|
||||
def lane_confidence(output: Mapping[str, Any], fallback: float = 0.7) -> float:
|
||||
for key in ("confidence", "score"):
|
||||
try:
|
||||
return float(output[key])
|
||||
except (KeyError, TypeError, ValueError):
|
||||
pass
|
||||
labels = output.get("labels")
|
||||
if isinstance(labels, Mapping):
|
||||
vals: list[float] = []
|
||||
for value in labels.values():
|
||||
if isinstance(value, Mapping) and "confidence" in value:
|
||||
try:
|
||||
vals.append(float(value["confidence"]))
|
||||
except (TypeError, ValueError):
|
||||
continue
|
||||
if vals:
|
||||
return max(vals)
|
||||
return fallback
|
||||
|
||||
|
||||
def closed_authority_flags(extra: Mapping[str, Any] | None = None) -> dict[str, bool]:
|
||||
flags = dict(AUTHORITY_FLAGS_CLOSED)
|
||||
for key, value in (extra or {}).items():
|
||||
mapped = MAY_TO_CAN.get(key, key)
|
||||
if mapped in flags and mapped not in {"requires_human_approval", "advisory_only"}:
|
||||
flags[mapped] = bool(value)
|
||||
return flags
|
||||
|
||||
|
||||
def authority_violations(flags: Mapping[str, Any]) -> list[str]:
|
||||
return sorted(
|
||||
key for key, value in flags.items()
|
||||
if key.startswith("can_") and bool(value)
|
||||
)
|
||||
|
||||
|
||||
def severity_for(label: str) -> str:
|
||||
if label in {"escalate", "block_authority_violation"}:
|
||||
return "critical"
|
||||
if label in {"require_human_review", "review_item", "ready_for_review", "prepare_context_bundle"}:
|
||||
return "medium"
|
||||
if label in {"summarize", "log"}:
|
||||
return "info"
|
||||
return "none"
|
||||
|
||||
|
||||
def npu_proof_v1(proof: Mapping[str, Any]) -> dict[str, Any]:
|
||||
busy = proof.get("npu_busy_delta_us") or proof.get("busy_delta_us")
|
||||
service_delta = proof.get("service_reported_delta_us") or proof.get("npu_busy_delta_us")
|
||||
proof_ok = proof.get("ok")
|
||||
if proof_ok is None and busy is not None:
|
||||
try:
|
||||
proof_ok = int(busy) > 0
|
||||
except (TypeError, ValueError):
|
||||
proof_ok = None
|
||||
fixture_only = bool(proof.get("fixture_only", True))
|
||||
return {
|
||||
"proof_mode": "offline_fixture" if fixture_only else "service_reported_delta",
|
||||
"busy_delta_us": int(busy) if isinstance(busy, int) or (isinstance(busy, str) and busy.isdigit()) else None,
|
||||
"service_reported_delta_us": int(service_delta) if isinstance(service_delta, int) or (isinstance(service_delta, str) and service_delta.isdigit()) else None,
|
||||
"inference_ran": bool(proof_ok) if proof_ok is not None else False,
|
||||
"proof_ok": bool(proof_ok) if proof_ok is not None else None,
|
||||
"counter_path": None,
|
||||
}
|
||||
|
||||
|
||||
def compare_outcome(recommendation: str, expected: str, human: str) -> str:
|
||||
if recommendation == human == expected:
|
||||
return "agree"
|
||||
if recommendation in {"escalate", "summarize", "review_item", "require_human_review", "prepare_context_bundle"} and human in {"log", "suppress", "none"}:
|
||||
return "false_positive"
|
||||
if recommendation in {"log", "suppress", "none"} and human in {"escalate", "summarize", "review_item", "require_human_review", "prepare_context_bundle"}:
|
||||
return "false_negative"
|
||||
if recommendation in {"uncertain", "defer"}:
|
||||
return "uncertain"
|
||||
return "disagree"
|
||||
|
||||
|
||||
def evaluate_context_gate(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
context_gate = load_module("openvino_context_gate.context_gate", REPO_ROOT / "openvino_context_gate" / "context_gate.py")
|
||||
plan = context_gate.build_plan(str(fixture["query"]), context=fixture.get("context") or {}, options={"require_npu_proof": False})
|
||||
blocked = plan["bundle_plan"].get("blocked_fields") or []
|
||||
if blocked:
|
||||
recommendation = "require_human_review"
|
||||
elif plan["bundle_plan"]["bundle_name"] in {"CodingTaskBundle", "OpsDebugBundle", "ResearchBundle"}:
|
||||
recommendation = "prepare_context_bundle"
|
||||
else:
|
||||
recommendation = "answer_directly"
|
||||
return {
|
||||
"recommendation": recommendation,
|
||||
"confidence": plan["query_class"].get("confidence", 0.7),
|
||||
"npu_proof": plan["npu_proof"],
|
||||
"notes": [f"bundle={plan['bundle_plan']['bundle_name']}", f"sources={','.join(s['source'] for s in plan['source_plan'])}"],
|
||||
"raw_compact": {"bundle_name": plan["bundle_plan"]["bundle_name"], "sources": [s["source"] for s in plan["source_plan"]], "blocked_fields": [f["field"] for f in blocked]},
|
||||
}
|
||||
|
||||
|
||||
def cron_recommendation(envelope: Mapping[str, Any], event: Mapping[str, Any]) -> str:
|
||||
labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
|
||||
urgency = (((labels.get("urgency") or {}).get("value")) if isinstance(labels.get("urgency"), Mapping) else labels.get("urgency")) or "normal"
|
||||
npu = envelope.get("npu_proof") or {}
|
||||
npu_ok = bool(npu.get("ok") is True and int(npu.get("npu_busy_delta_us") or 0) > 0)
|
||||
severity = str(event.get("severity") or "normal")
|
||||
if not npu_ok:
|
||||
return "log"
|
||||
if severity == "critical":
|
||||
return "escalate"
|
||||
if severity == "warning" or urgency in {"high", "critical"}:
|
||||
return "summarize"
|
||||
return "log"
|
||||
|
||||
|
||||
def evaluate_cron_n8n(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
envelope = fixture.get("gateway_envelope") or {}
|
||||
event = fixture.get("event") or {}
|
||||
labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
|
||||
confidence = lane_confidence({"labels": labels}, 0.6)
|
||||
return {
|
||||
"recommendation": cron_recommendation(envelope, event),
|
||||
"confidence": confidence,
|
||||
"npu_proof": envelope.get("npu_proof") or {},
|
||||
"authority_from_envelope": envelope.get("authority") or {},
|
||||
"notes": [f"workflow={event.get('workflow')}", f"severity={event.get('severity')}"]
|
||||
}
|
||||
|
||||
|
||||
def evaluate_batch_triage(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
text = str(fixture.get("document_text") or "")
|
||||
reasons = sorted(name for name, rx in ACTION_PATTERNS.items() if rx.search(text))
|
||||
if reasons:
|
||||
recommendation = "review_item"
|
||||
conf = 0.82
|
||||
elif len(text.strip()) < 20:
|
||||
recommendation = "uncertain"
|
||||
conf = 0.35
|
||||
else:
|
||||
recommendation = "suppress"
|
||||
conf = 0.64
|
||||
return {
|
||||
"recommendation": recommendation,
|
||||
"confidence": conf,
|
||||
"npu_proof": {"verified": False, "required": False, "note": "fixture_rules_no_npu_claim"},
|
||||
"notes": [f"lane={fixture.get('triage_lane')}", f"reason_codes={','.join(reasons) or 'none'}"],
|
||||
"raw_compact": {"reasons": reasons, "raw_text_redacted": True, "full_path_included": False},
|
||||
}
|
||||
|
||||
|
||||
def evaluate_voice_audio(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
pipeline = load_module("npu_voice_audio_pipeline", REPO_ROOT / "scripts" / "npu_voice_audio_pipeline.py")
|
||||
proof = fixture.get("npu_proof") or {}
|
||||
action_worthy, atlas_gate, next_gate = pipeline.decide_gate(
|
||||
str(fixture.get("transcript") or ""),
|
||||
dict(fixture.get("labels") or {}),
|
||||
whisper_proven=bool(proof.get("whisper")),
|
||||
classifier_proven=bool(proof.get("classifier")),
|
||||
)
|
||||
if atlas_gate.startswith("blocked"):
|
||||
recommendation = "require_human_review"
|
||||
elif action_worthy:
|
||||
recommendation = "review_item"
|
||||
else:
|
||||
recommendation = "suppress"
|
||||
return {
|
||||
"recommendation": recommendation,
|
||||
"confidence": 0.86 if action_worthy else 0.66,
|
||||
"npu_proof": {"whisper": bool(proof.get("whisper")), "classifier": bool(proof.get("classifier")), "verified": bool(proof.get("whisper") and proof.get("classifier"))},
|
||||
"notes": [f"atlas_gate={atlas_gate}", f"next_gate={next_gate}", "transcript_redacted=true"],
|
||||
"raw_compact": {"action_worthy": action_worthy, "atlas_gate": atlas_gate, "next_gate": next_gate},
|
||||
}
|
||||
|
||||
|
||||
def evaluate_kanban_hygiene(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
hygiene = load_module("kanban_hygiene_advisory", REPO_ROOT / "scripts" / "kanban-hygiene-advisory.py")
|
||||
out = hygiene.advisory(list(fixture.get("tasks") or []), board="synthetic-npu", now=float(fixture.get("now") or time.time()), input_metadata={}, include_evidence=False)
|
||||
item = out["items"][0]
|
||||
next_gate = item["next_gate"]["value"]
|
||||
return {
|
||||
"recommendation": next_gate,
|
||||
"confidence": item["next_gate"].get("confidence", 0.7),
|
||||
"npu_proof": out["npu_proof"],
|
||||
"notes": [f"task_id={item['task_id']}", f"review_needed={item['review_needed']['value']}"],
|
||||
"raw_compact": {"counts": out["counts"], "next_gate": item["next_gate"]},
|
||||
}
|
||||
|
||||
|
||||
def evaluate_gateway_envelope(fixture: Mapping[str, Any]) -> dict[str, Any]:
|
||||
envelope = fixture.get("gateway_envelope") or {}
|
||||
flags = closed_authority_flags(envelope.get("authority") or {})
|
||||
violations = authority_violations(flags)
|
||||
if violations:
|
||||
recommendation = "block_authority_violation"
|
||||
else:
|
||||
recommendation = cron_recommendation(envelope, {"severity": "critical"})
|
||||
labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
|
||||
return {
|
||||
"recommendation": recommendation,
|
||||
"confidence": lane_confidence({"labels": labels}, 0.8),
|
||||
"npu_proof": envelope.get("npu_proof") or {},
|
||||
"authority_from_envelope": envelope.get("authority") or {},
|
||||
"notes": [f"violations={','.join(violations) or 'none'}", f"trace_id={envelope.get('trace_id')}"]
|
||||
}
|
||||
|
||||
|
||||
EVALUATORS = {
|
||||
"context_gate": evaluate_context_gate,
|
||||
"cron_n8n_advisory": evaluate_cron_n8n,
|
||||
"batch_triage": evaluate_batch_triage,
|
||||
"voice_audio": evaluate_voice_audio,
|
||||
"kanban_hygiene": evaluate_kanban_hygiene,
|
||||
"advisory_gateway_envelope": evaluate_gateway_envelope,
|
||||
}
|
||||
|
||||
|
||||
def build_decision(fixture: Mapping[str, Any], evaluated: Mapping[str, Any]) -> dict[str, Any]:
|
||||
extra_authority = evaluated.get("authority_from_envelope") if isinstance(evaluated.get("authority_from_envelope"), Mapping) else None
|
||||
authority_flags = closed_authority_flags(extra_authority)
|
||||
violations = authority_violations(authority_flags)
|
||||
recommendation = str(evaluated["recommendation"])
|
||||
human = str(fixture["human_or_atlas_decision"])
|
||||
expected = str(fixture["expected_recommendation"])
|
||||
outcome_label = compare_outcome(recommendation, expected, human)
|
||||
if recommendation == expected and outcome_label != str(fixture.get("expected_outcome", outcome_label)):
|
||||
outcome_label = str(fixture.get("expected_outcome"))
|
||||
confidence_score = float(evaluated.get("confidence") or 0.0)
|
||||
npu_raw = dict(evaluated.get("npu_proof") or {})
|
||||
npu_raw.setdefault("fixture_only", True)
|
||||
fixture_id = str(fixture.get("id"))
|
||||
input_class = str(fixture.get("input_class") or fixture.get("lane") or "unknown")
|
||||
service_name = str(fixture.get("service") or fixture.get("lane") or "unknown")
|
||||
source_kind = str(fixture.get("source") or "fixture")
|
||||
comparison = "agree" if outcome_label == "agree" else ("uncertain" if outcome_label == "uncertain" else "disagree")
|
||||
error_type = outcome_label if outcome_label in {"false_positive", "false_negative", "severity_overcall", "severity_undercall"} else None
|
||||
if violations:
|
||||
error_type = "unsafe_authority"
|
||||
return {
|
||||
"schema_version": SCHEMA,
|
||||
"decision_id": str(uuid.uuid5(uuid.NAMESPACE_URL, f"{SCHEMA}:{fixture_id}")),
|
||||
"timestamp": dt.datetime.now(dt.timezone.utc).isoformat(timespec="seconds"),
|
||||
"source": {
|
||||
"kind": "fixture",
|
||||
"fixture_id": fixture_id,
|
||||
"fixture_set": "npu_advisory_eval_v1",
|
||||
"artifact_ref": None,
|
||||
"content_hash": "sha256:" + hashlib.sha256(json.dumps(fixture, sort_keys=True, default=str).encode()).hexdigest(),
|
||||
"privacy_class": "synthetic" if source_kind.startswith("synthetic") else "non_private",
|
||||
},
|
||||
"service": {
|
||||
"name": service_name,
|
||||
"endpoint": service_name,
|
||||
"mode": "offline_fixture",
|
||||
"model": "openvino-local-fixture",
|
||||
},
|
||||
"input_class": input_class,
|
||||
"recommendation": {
|
||||
"label": recommendation,
|
||||
"severity": severity_for(recommendation),
|
||||
"reasons": list(evaluated.get("notes") or []),
|
||||
"evidence_refs": [f"fixture:{fixture_id}", f"lane:{fixture.get('lane')}"] ,
|
||||
"raw_output_ref": None,
|
||||
},
|
||||
"expected_recommendation": expected,
|
||||
"confidence": {
|
||||
"score": round(confidence_score, 3),
|
||||
"bucket": confidence_bucket(confidence_score),
|
||||
"bucket_rule": "v1_default",
|
||||
"calibrated": False,
|
||||
},
|
||||
"authority_flags": authority_flags,
|
||||
"allowed_actions": ALLOWED_ACTIONS,
|
||||
"actual_action": dict(NO_ACTUAL_ACTION),
|
||||
"human_or_atlas_decision": {
|
||||
"source": "fixture_expected",
|
||||
"label": human,
|
||||
"severity": severity_for(human),
|
||||
"confidence": None,
|
||||
"decision_ref": fixture_id,
|
||||
"timestamp": None,
|
||||
},
|
||||
"outcome": {
|
||||
"comparison": comparison,
|
||||
"label": outcome_label,
|
||||
"error_type": error_type,
|
||||
"human_review_required": bool(violations or recommendation in {"require_human_review", "block_authority_violation"}),
|
||||
"promotion_blocker": bool(violations or error_type in {"false_negative", "unsafe_authority", "privacy_violation"}),
|
||||
},
|
||||
"expected_outcome": fixture.get("expected_outcome"),
|
||||
"npu_proof": npu_proof_v1(npu_raw),
|
||||
"latency": {"total_ms": 0, "service_ms": None, "queue_ms": None, "timeout": False},
|
||||
"fallback": {"occurred": True, "kind": "offline", "reason": "synthetic_fixture_deterministic_adapter_no_live_service_call", "expected": True},
|
||||
"privacy": {"payload_logged": False, "redaction": "metadata_only", "retention": "local_audit", "contains_private_payload": False},
|
||||
"notes": list(evaluated.get("notes") or []),
|
||||
"authority_safe_flag_violations": violations,
|
||||
# Compatibility fields for compact summaries/tests.
|
||||
"fixture_id": fixture_id,
|
||||
"lane": fixture.get("lane"),
|
||||
}
|
||||
|
||||
|
||||
def run(fixtures_path: Path) -> dict[str, Any]:
|
||||
data = json.loads(fixtures_path.read_text(encoding="utf-8"))
|
||||
fixtures = data.get("fixtures")
|
||||
if not isinstance(fixtures, list) or not fixtures:
|
||||
raise HarnessError("fixture_set_empty")
|
||||
decisions = []
|
||||
started = time.perf_counter()
|
||||
for fixture in fixtures:
|
||||
lane = fixture.get("lane")
|
||||
evaluator = EVALUATORS.get(str(lane))
|
||||
if evaluator is None:
|
||||
raise HarnessError(f"unsupported_lane:{lane}")
|
||||
t0 = time.perf_counter()
|
||||
evaluated = evaluator(fixture)
|
||||
decision = build_decision(fixture, evaluated)
|
||||
decision["latency"]["total_ms"] = round((time.perf_counter() - t0) * 1000, 3)
|
||||
decisions.append(decision)
|
||||
|
||||
counts = Counter(d["outcome"]["label"] for d in decisions)
|
||||
by_lane: dict[str, Counter[str]] = defaultdict(Counter)
|
||||
confidence = Counter(d["confidence"]["bucket"] for d in decisions)
|
||||
recommendations = Counter(d["recommendation"]["label"] for d in decisions)
|
||||
violations = [d for d in decisions if d["authority_safe_flag_violations"]]
|
||||
mismatches = [d for d in decisions if d["outcome"]["label"] != d.get("expected_outcome")]
|
||||
return {
|
||||
"schema": HARNESS_SCHEMA,
|
||||
"fixture_file": str(fixtures_path),
|
||||
"dry_run": True,
|
||||
"mutations": dict(MUTATION_FLAGS_FALSE),
|
||||
"totals": {
|
||||
"fixtures": len(decisions),
|
||||
"agree": counts.get("agree", 0),
|
||||
"disagree": counts.get("disagree", 0),
|
||||
"uncertain": counts.get("uncertain", 0),
|
||||
"false_positive": counts.get("false_positive", 0),
|
||||
"false_negative": counts.get("false_negative", 0),
|
||||
"authority_safe_flag_violations": len(violations),
|
||||
"expected_outcome_mismatches": len(mismatches),
|
||||
"wall_ms": round((time.perf_counter() - started) * 1000, 3),
|
||||
},
|
||||
"by_lane": lane_summary(decisions),
|
||||
"confidence_buckets": dict(sorted(confidence.items())),
|
||||
"recommendations": dict(sorted(recommendations.items())),
|
||||
"minimum_metrics": minimum_metrics(decisions),
|
||||
"violations": [{"fixture_id": d["fixture_id"], "flags": d["authority_safe_flag_violations"]} for d in violations],
|
||||
"mismatches": [{"fixture_id": d["fixture_id"], "outcome": d["outcome"]["label"], "expected_outcome": d.get("expected_outcome")} for d in mismatches],
|
||||
"decisions": decisions,
|
||||
}
|
||||
|
||||
|
||||
def percentile(values: list[float], pct: float) -> float | None:
|
||||
if not values:
|
||||
return None
|
||||
ordered = sorted(values)
|
||||
idx = min(len(ordered) - 1, max(0, round((pct / 100) * (len(ordered) - 1))))
|
||||
return ordered[idx]
|
||||
|
||||
|
||||
def minimum_metrics(decisions: list[dict[str, Any]]) -> dict[str, Any]:
|
||||
by_input = Counter(d["input_class"] for d in decisions)
|
||||
by_service = Counter(d["service"]["name"] for d in decisions)
|
||||
fallback_kinds = Counter(d["fallback"]["kind"] for d in decisions if d["fallback"]["occurred"])
|
||||
proof_ok = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is True)
|
||||
proof_missing = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is False)
|
||||
proof_na = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is None)
|
||||
privacy_violations = sum(1 for d in decisions if d["privacy"]["contains_private_payload"] or d["privacy"]["payload_logged"])
|
||||
side_effects = sum(1 for d in decisions if d["actual_action"]["performed"] or d["actual_action"]["side_effects"])
|
||||
timeouts = sum(1 for d in decisions if d["latency"].get("timeout"))
|
||||
lat_by_service: dict[str, dict[str, float | None]] = {}
|
||||
for service in by_service:
|
||||
vals = [float(d["latency"]["total_ms"]) for d in decisions if d["service"]["name"] == service]
|
||||
lat_by_service[service] = {"p50_ms": percentile(vals, 50), "p95_ms": percentile(vals, 95)}
|
||||
lat_by_input: dict[str, dict[str, float | None]] = {}
|
||||
for input_class in by_input:
|
||||
vals = [float(d["latency"]["total_ms"]) for d in decisions if d["input_class"] == input_class]
|
||||
lat_by_input[input_class] = {"p50_ms": percentile(vals, 50), "p95_ms": percentile(vals, 95)}
|
||||
outcomes = Counter(d["outcome"]["label"] for d in decisions)
|
||||
return {
|
||||
"total_records": len(decisions),
|
||||
"records_by_input_class": dict(sorted(by_input.items())),
|
||||
"records_by_service": dict(sorted(by_service.items())),
|
||||
"privacy_violation_count": privacy_violations,
|
||||
"actual_side_effect_count": side_effects,
|
||||
"missing_reference_count": outcomes.get("missing_reference", 0),
|
||||
"fallback_count": sum(fallback_kinds.values()),
|
||||
"fallback_counts_by_kind": dict(sorted(fallback_kinds.items())),
|
||||
"expected_fallback_count": sum(1 for d in decisions if d["fallback"]["occurred"] and d["fallback"]["expected"]),
|
||||
"unexpected_fallback_count": sum(1 for d in decisions if d["fallback"]["occurred"] and not d["fallback"]["expected"]),
|
||||
"npu_proof_ok_count": proof_ok,
|
||||
"npu_proof_missing_count": proof_missing,
|
||||
"npu_proof_not_applicable_count": proof_na,
|
||||
"latency_by_service": lat_by_service,
|
||||
"latency_by_input_class": lat_by_input,
|
||||
"timeout_count": timeouts,
|
||||
}
|
||||
|
||||
|
||||
def lane_summary(decisions: list[dict[str, Any]]) -> dict[str, dict[str, Any]]:
|
||||
lanes: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||
for d in decisions:
|
||||
lanes[str(d["lane"])].append(d)
|
||||
out = {}
|
||||
for lane, items in sorted(lanes.items()):
|
||||
c = Counter(d["outcome"]["label"] for d in items)
|
||||
out[lane] = {
|
||||
"fixtures": len(items),
|
||||
"agree": c.get("agree", 0),
|
||||
"disagree": c.get("disagree", 0),
|
||||
"false_positive": c.get("false_positive", 0),
|
||||
"false_negative": c.get("false_negative", 0),
|
||||
"uncertain": c.get("uncertain", 0),
|
||||
"authority_safe_flag_violations": sum(1 for d in items if d["authority_safe_flag_violations"]),
|
||||
}
|
||||
return out
|
||||
|
||||
|
||||
def markdown_summary(summary: Mapping[str, Any]) -> str:
|
||||
totals = summary["totals"]
|
||||
lines = [
|
||||
"# NPU advisory dry-run comparison",
|
||||
"",
|
||||
f"fixtures: {totals['fixtures']} | agree: {totals['agree']} | disagree: {totals['disagree']} | false_positive: {totals['false_positive']} | false_negative: {totals['false_negative']} | uncertain: {totals['uncertain']}",
|
||||
f"authority_safe_flag_violations: {totals['authority_safe_flag_violations']} | mutations: all_false",
|
||||
"",
|
||||
"| lane | fixtures | agree | false_positive | false_negative | violations |",
|
||||
"| --- | ---: | ---: | ---: | ---: | ---: |",
|
||||
]
|
||||
for lane, row in summary["by_lane"].items():
|
||||
lines.append(f"| {lane} | {row['fixtures']} | {row['agree']} | {row['false_positive']} | {row['false_negative']} | {row['authority_safe_flag_violations']} |")
|
||||
if summary.get("violations"):
|
||||
lines.extend(["", "## Authority-safe flag violations"])
|
||||
for violation in summary["violations"]:
|
||||
lines.append(f"- {violation['fixture_id']}: {', '.join(violation['flags'])}")
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="Run synthetic advisory-only NPU dry-run fixture comparisons.")
|
||||
parser.add_argument("--fixtures", default=str(DEFAULT_FIXTURES), help="Synthetic fixture JSON file")
|
||||
parser.add_argument("--format", choices=["json", "markdown"], default="json")
|
||||
parser.add_argument("--include-decisions", action="store_true", help="Include per-fixture decision records in JSON output")
|
||||
parser.add_argument("--fail-on-mismatch", action="store_true", help="Return non-zero if observed outcome differs from fixture expected_outcome")
|
||||
parser.add_argument("--fail-on-authority-violation", action="store_true", help="Return non-zero if any fixture exposes may_* authority flags set true")
|
||||
return parser
|
||||
|
||||
|
||||
def main(argv: list[str] | None = None) -> int:
|
||||
args = build_parser().parse_args(argv)
|
||||
try:
|
||||
summary = run(Path(args.fixtures).expanduser().resolve())
|
||||
except (OSError, json.JSONDecodeError, HarnessError) as exc:
|
||||
print(json.dumps({"ok": False, "error": str(exc), "dry_run": True, "mutations": MUTATION_FLAGS_FALSE}, sort_keys=True), file=sys.stderr)
|
||||
return 2
|
||||
if args.format == "markdown":
|
||||
print(markdown_summary(summary), end="")
|
||||
else:
|
||||
out = dict(summary)
|
||||
if not args.include_decisions:
|
||||
out.pop("decisions", None)
|
||||
print(json.dumps(out, sort_keys=True, separators=(",", ":")))
|
||||
if args.fail_on_mismatch and summary["totals"]["expected_outcome_mismatches"]:
|
||||
return 1
|
||||
if args.fail_on_authority_violation and summary["totals"]["authority_safe_flag_violations"]:
|
||||
return 1
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
@@ -0,0 +1,129 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib.util
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[1]
|
||||
SCRIPT = ROOT / "scripts" / "npu-advisory-dry-run-comparison.py"
|
||||
FIXTURES = ROOT / "fixtures" / "npu_advisory_dry_run" / "fixtures.json"
|
||||
|
||||
|
||||
def load_harness():
|
||||
spec = importlib.util.spec_from_file_location("npu_advisory_dry_run_comparison", SCRIPT)
|
||||
assert spec and spec.loader
|
||||
module = importlib.util.module_from_spec(spec)
|
||||
sys.modules[spec.name] = module
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def test_fixture_set_covers_all_required_advisory_lanes() -> None:
|
||||
fixtures = json.loads(FIXTURES.read_text())["fixtures"]
|
||||
lanes = {fixture["lane"] for fixture in fixtures}
|
||||
assert {
|
||||
"context_gate",
|
||||
"cron_n8n_advisory",
|
||||
"batch_triage",
|
||||
"voice_audio",
|
||||
"kanban_hygiene",
|
||||
"advisory_gateway_envelope",
|
||||
}.issubset(lanes)
|
||||
assert all("expected_recommendation" in fixture for fixture in fixtures)
|
||||
assert all("human_or_atlas_decision" in fixture for fixture in fixtures)
|
||||
|
||||
|
||||
def test_harness_outputs_compact_summary_and_decision_schema() -> None:
|
||||
harness = load_harness()
|
||||
summary = harness.run(FIXTURES)
|
||||
assert summary["schema"] == "npu_advisory_dry_run_summary_v1"
|
||||
assert summary["dry_run"] is True
|
||||
assert all(value is False for value in summary["mutations"].values())
|
||||
assert summary["totals"]["fixtures"] >= 6
|
||||
assert summary["totals"]["agree"] >= 1
|
||||
assert summary["totals"]["false_positive"] >= 1
|
||||
assert summary["totals"]["authority_safe_flag_violations"] == 1
|
||||
|
||||
for decision in summary["decisions"]:
|
||||
assert decision["schema_version"] == "npu_advisory_decision_v1"
|
||||
assert decision["decision_id"]
|
||||
assert isinstance(decision["source"], dict)
|
||||
assert isinstance(decision["service"], dict)
|
||||
assert isinstance(decision["recommendation"], dict)
|
||||
assert isinstance(decision["confidence"], dict)
|
||||
assert isinstance(decision["actual_action"], dict)
|
||||
assert decision["actual_action"]["performed"] is False
|
||||
assert decision["actual_action"]["side_effects"] == []
|
||||
assert decision["allowed_actions"] == ["record_metric", "compare_with_expected_label", "include_in_digest", "recommend_human_review"]
|
||||
assert isinstance(decision["human_or_atlas_decision"], dict)
|
||||
assert isinstance(decision["outcome"], dict)
|
||||
assert isinstance(decision["npu_proof"], dict)
|
||||
assert isinstance(decision["latency"], dict)
|
||||
assert isinstance(decision["fallback"], dict)
|
||||
assert decision["privacy"]["payload_logged"] is False
|
||||
assert decision["privacy"]["contains_private_payload"] is False
|
||||
assert decision["authority_flags"]["advisory_only"] is True
|
||||
assert decision["authority_flags"]["requires_human_approval"] is True
|
||||
assert "notes" in decision
|
||||
metrics = summary["minimum_metrics"]
|
||||
assert metrics["privacy_violation_count"] == 0
|
||||
assert metrics["actual_side_effect_count"] == 0
|
||||
assert "records_by_input_class" in metrics
|
||||
assert "records_by_service" in metrics
|
||||
assert "fallback_counts_by_kind" in metrics
|
||||
assert "latency_by_service" in metrics
|
||||
|
||||
|
||||
def test_each_lane_has_expected_recommendation() -> None:
|
||||
harness = load_harness()
|
||||
summary = harness.run(FIXTURES)
|
||||
by_id = {decision["source"]["fixture_id"]: decision for decision in summary["decisions"]}
|
||||
assert by_id["context-gate-coding-safe"]["recommendation"]["label"] == "prepare_context_bundle"
|
||||
assert by_id["cron-normal-log"]["recommendation"]["label"] == "log"
|
||||
assert by_id["batch-receipt-action"]["recommendation"]["label"] == "review_item"
|
||||
assert by_id["voice-audio-action-needed"]["recommendation"]["label"] == "require_human_review"
|
||||
assert by_id["kanban-review-ready"]["recommendation"]["label"] == "ready_for_review"
|
||||
assert by_id["gateway-authority-violation"]["recommendation"]["label"] == "block_authority_violation"
|
||||
|
||||
|
||||
def test_cli_json_and_markdown_are_parseable_and_no_mismatch() -> None:
|
||||
json_result = subprocess.run(
|
||||
[sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--format", "json", "--fail-on-mismatch"],
|
||||
cwd=ROOT,
|
||||
text=True,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
check=False,
|
||||
)
|
||||
assert json_result.returncode == 0, json_result.stderr
|
||||
parsed = json.loads(json_result.stdout)
|
||||
assert parsed["totals"]["expected_outcome_mismatches"] == 0
|
||||
assert "decisions" not in parsed
|
||||
|
||||
md_result = subprocess.run(
|
||||
[sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--format", "markdown"],
|
||||
cwd=ROOT,
|
||||
text=True,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
check=False,
|
||||
)
|
||||
assert md_result.returncode == 0, md_result.stderr
|
||||
assert "# NPU advisory dry-run comparison" in md_result.stdout
|
||||
assert "| context_gate |" in md_result.stdout
|
||||
|
||||
|
||||
def test_authority_violation_gate_can_fail_ci_when_requested() -> None:
|
||||
result = subprocess.run(
|
||||
[sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--fail-on-authority-violation"],
|
||||
cwd=ROOT,
|
||||
text=True,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
check=False,
|
||||
)
|
||||
assert result.returncode == 1
|
||||
parsed = json.loads(result.stdout)
|
||||
assert parsed["totals"]["authority_safe_flag_violations"] == 1
|
||||
Reference in New Issue
Block a user