Files
swarm-master/docs/npu-advisory-dry-run-comparison.md
William Valentin dae2a57124 feat(npu): add advisory dry-run comparison harness
Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.
2026-06-06 15:30:31 -07:00

56 lines
2.8 KiB
Markdown

# NPU advisory dry-run comparison harness
This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.
For the operator runbook and promotion criteria, see `docs/npu-advisory-observability-runbook.md`. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.
## Run
From `/home/will/lab/swarm`:
```bash
python scripts/npu-advisory-dry-run-comparison.py --format json
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
python scripts/npu-advisory-dry-run-comparison.py --format markdown
```
Strict checks for CI/review:
```bash
python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation
```
`--fail-on-authority-violation` is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that `may_* = true` is caught and summarized.
## Fixture coverage
Fixtures live at `fixtures/npu_advisory_dry_run/fixtures.json` and cover:
- context gate;
- cron/n8n advisory events;
- batch document/audio triage shape;
- voice/audio advisory gate;
- Kanban hygiene advisory;
- advisory gateway envelopes.
All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.
## Output shape
JSON output uses `npu_advisory_dry_run_summary_v1` and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture `npu_advisory_decision_v1` records.
Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (`none_dry_run`), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.
## Promotion gate
Before any future advisory lane receives authority, a separate approval should require at minimum:
- no expected-outcome mismatches for that lane's representative fixture set;
- no false negatives on action-needed events;
- intentionally reviewed false positives;
- zero authority-safe flag violations except known negative-control fixtures;
- documented rollback and a narrow, explicit authority scope.
Passing this harness never grants live authority by itself. Advisory outputs flow into `npu_advisory_decision_v1` records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.