dae2a57124
Add npu_advisory_decision_v1 schema, synthetic fixture set, comparison harness, docs, and focused tests for advisory-only NPU evaluation.
56 lines
2.8 KiB
Markdown
56 lines
2.8 KiB
Markdown
# NPU advisory dry-run comparison harness
|
|
|
|
This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.
|
|
|
|
For the operator runbook and promotion criteria, see `docs/npu-advisory-observability-runbook.md`. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.
|
|
|
|
## Run
|
|
|
|
From `/home/will/lab/swarm`:
|
|
|
|
```bash
|
|
python scripts/npu-advisory-dry-run-comparison.py --format json
|
|
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
|
|
python scripts/npu-advisory-dry-run-comparison.py --format markdown
|
|
```
|
|
|
|
Strict checks for CI/review:
|
|
|
|
```bash
|
|
python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
|
|
python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation
|
|
```
|
|
|
|
`--fail-on-authority-violation` is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that `may_* = true` is caught and summarized.
|
|
|
|
## Fixture coverage
|
|
|
|
Fixtures live at `fixtures/npu_advisory_dry_run/fixtures.json` and cover:
|
|
|
|
- context gate;
|
|
- cron/n8n advisory events;
|
|
- batch document/audio triage shape;
|
|
- voice/audio advisory gate;
|
|
- Kanban hygiene advisory;
|
|
- advisory gateway envelopes.
|
|
|
|
All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.
|
|
|
|
## Output shape
|
|
|
|
JSON output uses `npu_advisory_dry_run_summary_v1` and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture `npu_advisory_decision_v1` records.
|
|
|
|
Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (`none_dry_run`), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.
|
|
|
|
## Promotion gate
|
|
|
|
Before any future advisory lane receives authority, a separate approval should require at minimum:
|
|
|
|
- no expected-outcome mismatches for that lane's representative fixture set;
|
|
- no false negatives on action-needed events;
|
|
- intentionally reviewed false positives;
|
|
- zero authority-safe flag violations except known negative-control fixtures;
|
|
- documented rollback and a narrow, explicit authority scope.
|
|
|
|
Passing this harness never grants live authority by itself. Advisory outputs flow into `npu_advisory_decision_v1` records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.
|