Files

T

William Valentin dae2a57124 feat(npu): add advisory dry-run comparison harness

Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.

2026-06-06 15:30:31 -07:00

2.8 KiB

Raw Blame History

NPU advisory dry-run comparison harness

This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.

For the operator runbook and promotion criteria, see docs/npu-advisory-observability-runbook.md. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.

Run

From /home/will/lab/swarm:

python scripts/npu-advisory-dry-run-comparison.py --format json
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
python scripts/npu-advisory-dry-run-comparison.py --format markdown

Strict checks for CI/review:

python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation

--fail-on-authority-violation is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that may_* = true is caught and summarized.

Fixture coverage

Fixtures live at fixtures/npu_advisory_dry_run/fixtures.json and cover:

context gate;
cron/n8n advisory events;
batch document/audio triage shape;
voice/audio advisory gate;
Kanban hygiene advisory;
advisory gateway envelopes.

All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.

Output shape

JSON output uses npu_advisory_dry_run_summary_v1 and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture npu_advisory_decision_v1 records.

Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (none_dry_run), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.

Promotion gate

Before any future advisory lane receives authority, a separate approval should require at minimum:

no expected-outcome mismatches for that lane's representative fixture set;
no false negatives on action-needed events;
intentionally reviewed false positives;
zero authority-safe flag violations except known negative-control fixtures;
documented rollback and a narrow, explicit authority scope.

Passing this harness never grants live authority by itself. Advisory outputs flow into npu_advisory_decision_v1 records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.

2.8 KiB Raw Blame History

NPU advisory dry-run comparison harness

Run

Fixture coverage

Output shape

Promotion gate

2.8 KiB

Raw Blame History