swarm-master/docs/npu-advisory-dry-run-comparison.md

# NPU advisory dry-run comparison harness

This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.

For the operator runbook and promotion criteria, see `docs/npu-advisory-observability-runbook.md`. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.

## Run

From `/home/will/lab/swarm`:

```bash
python scripts/npu-advisory-dry-run-comparison.py --format json
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
python scripts/npu-advisory-dry-run-comparison.py --format markdown
```

Strict checks for CI/review:

```bash
python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation
```

`--fail-on-authority-violation` is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that `may_* = true` is caught and summarized.

## Fixture coverage

Fixtures live at `fixtures/npu_advisory_dry_run/fixtures.json` and cover:

- context gate;
- cron/n8n advisory events;
- batch document/audio triage shape;
- voice/audio advisory gate;
- Kanban hygiene advisory;
- advisory gateway envelopes.

All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.

## Output shape

JSON output uses `npu_advisory_dry_run_summary_v1` and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture `npu_advisory_decision_v1` records.

Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (`none_dry_run`), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.

## Promotion gate

Before any future advisory lane receives authority, a separate approval should require at minimum:

- no expected-outcome mismatches for that lane's representative fixture set;
- no false negatives on action-needed events;
- intentionally reviewed false positives;
- zero authority-safe flag violations except known negative-control fixtures;
- documented rollback and a narrow, explicit authority scope.

Passing this harness never grants live authority by itself. Advisory outputs flow into `npu_advisory_decision_v1` records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.