docs(npu): document advisory observability gates

Add operator runbook and link integrated health docs for advisory-only observability, dry-run metrics, and future promotion criteria.
feat(npu): add advisory metrics to utilization digest
2026-06-06 15:30:31 -07:00 · 2026-06-06 15:30:31 -07:00 · 2026-06-06 15:30:31 -07:00
10 changed files with 1833 additions and 4 deletions
@@ -0,0 +1,456 @@
+# NPU advisory decision schema and dry-run evaluation metrics
+
+This document defines the compact `npu_advisory_decision_v1` record and the
+minimum dry-run metrics required before any OpenVINO/NPU advisory lane is
+considered for promotion. The schema is advisory-only: it creates audit evidence
+and comparison data, not live authority.
+
+Scope and safety defaults:
+
+- Local audit records only; no outbound sends, service restarts, tool execution,
+  memory writes, routing changes, vector-store mutation, or broad private scans.
+- Synthetic or explicitly non-private fixtures only for dry-run evaluation.
+- Raw prompts, transcripts, documents, images, headers, secrets, and full upstream
+  JSON payloads are not persisted by default.
+- NPU output is evidence for a gate. It must never directly perform or trigger
+  an action.
+
+## `npu_advisory_decision_v1`
+
+Required top-level fields:
+
+| Field | Type | Required | Notes |
+| --- | --- | ---: | --- |
+| `schema_version` | string | yes | Always `npu_advisory_decision_v1`. |
+| `decision_id` | string | yes | Locally generated UUID/ULID. No payload-derived PII. |
+| `timestamp` | string | yes | RFC3339/ISO-8601 UTC timestamp. |
+| `source` | object | yes | Where the dry-run input came from. |
+| `service` | object | yes | Advisory lane/service that produced the recommendation. |
+| `input_class` | string | yes | Normalized class such as `context_gate`, `cron_n8n_event`, `batch_doc_triage`, `voice_audio`, `kanban_hygiene`, or `advisory_gateway_envelope`. |
+| `recommendation` | object | yes | NPU/advisory recommendation and rationale metadata. |
+| `confidence` | object | yes | Score, bucket, and calibration notes. |
+| `authority_flags` | object | yes | Explicit booleans for authority boundaries; all default false. |
+| `allowed_actions` | array[string] | yes | Actions a downstream gate may consider. Defaults to advisory-only actions. |
+| `actual_action` | object | yes | What really happened. In this gate it should always be no-op/record-only. |
+| `human_or_atlas_decision` | object | yes | Comparison target from fixture expected label, human label, or Atlas decision. |
+| `outcome` | object | yes | Agreement/error bucket used by the eval harness. |
+| `npu_proof` | object | yes | Evidence that a real NPU-backed inference ran, where available. |
+| `latency` | object | yes | Request latency and optional queue/processing timings. |
+| `fallback` | object | yes | Whether CPU/offline/health-only fallback happened and why. |
+| `privacy` | object | yes | What was redacted/hashed and what retention class applies. |
+| `notes` | array[string] | no | Short non-private audit notes. |
+
+### Field details
+
+`source`:
+
+- `kind`: `fixture`, `manual_label`, `atlas_shadow`, `human_review`, or
+  `service_health_probe`.
+- `fixture_id`: stable fixture identifier when applicable.
+- `fixture_set`: fixture collection name/version.
+- `artifact_ref`: optional local path or opaque run id; do not include raw
+  private content.
+- `content_hash`: optional SHA-256 over sanitized fixture content.
+- `privacy_class`: `synthetic`, `public`, `non_private`, `redacted`, or
+  `private_disallowed`.
+
+`service`:
+
+- `name`: e.g. `openvino_context_gate`, `cron_n8n_advisory`,
+  `npu_batch_triage`, `npu_voice_audio_pipeline`, `kanban_hygiene_advisory`,
+  `openvino_advisory_gateway`.
+- `endpoint`: local endpoint label or script name; avoid sensitive URL params.
+- `mode`: `dry_run`, `shadow`, `health_only`, or `offline_fixture`.
+- `model`: optional model/backend label, if safe to log.
+
+`recommendation`:
+
+- `label`: normalized recommendation, e.g. `suppress`, `log`, `summarize`,
+  `escalate`, `retrieve_more_context`, `skip_private_root`, `needs_human`,
+  `no_action`, or `unknown`.
+- `severity`: `none`, `info`, `low`, `medium`, `high`, or `critical`.
+- `reasons`: short non-private reason codes, not raw excerpts.
+- `evidence_refs`: bounded references to sanitized fixture fields or artifact ids.
+- `raw_output_ref`: optional local artifact pointer; default null.
+
+`confidence`:
+
+- `score`: float from 0.0 to 1.0 when available, otherwise null.
+- `bucket`: one of `very_low`, `low`, `medium`, `high`, `very_high`, or
+  `unknown`.
+- `bucket_rule`: the threshold rule used by the harness.
+- `calibrated`: boolean; false until enough labeled dry-run data exists.
+
+Recommended confidence buckets:
+
+| Bucket | Score range | Gate behavior |
+| --- | --- | --- |
+| `very_low` | `< 0.40` | Treat as uncertain; never escalate automatically. |
+| `low` | `0.40-0.59` | Advisory note only; human/Atlas decides. |
+| `medium` | `0.60-0.79` | Eligible for comparison metrics; no live action. |
+| `high` | `0.80-0.94` | Strong advisory evidence; still gated. |
+| `very_high` | `>= 0.95` | Promotion candidate only after repeated eval success. |
+| `unknown` | null/missing | Count separately; do not coerce to zero. |
+
+`authority_flags`:
+
+All flags default to false and must remain false for this gate.
+
+- `can_route_atlas`
+- `can_write_memory`
+- `can_execute_tools`
+- `can_restart_services`
+- `can_send_outbound`
+- `can_scan_private_roots`
+- `can_mutate_vector_store`
+- `can_post_advisory_event`
+- `can_change_gateway_config`
+- `requires_human_approval`
+- `advisory_only`
+
+For this gate, `advisory_only=true` and `requires_human_approval=true` for any
+recommendation that could eventually affect live behavior.
+
+`allowed_actions`:
+
+Allowed by default:
+
+- `record_metric`
+- `compare_with_expected_label`
+- `include_in_digest`
+- `open_review_ticket_candidate`
+- `recommend_human_review`
+
+Disallowed unless a later approval explicitly changes scope:
+
+- `route_atlas`
+- `write_memory`
+- `execute_tool`
+- `restart_service`
+- `send_message`
+- `scan_private_root`
+- `mutate_vector_store`
+- `post_gateway_event`
+
+`actual_action`:
+
+- `kind`: should be `none`, `recorded_metric`, or `dry_run_reported`.
+- `performed`: boolean; false for live side effects in this gate.
+- `performed_by`: `harness`, `human`, `atlas`, or null.
+- `side_effects`: array; should be empty except local report/artifact writes.
+
+`human_or_atlas_decision`:
+
+- `source`: `fixture_expected`, `human_label`, `atlas_shadow`, or `missing`.
+- `label`: normalized decision label using the same label set as
+  `recommendation.label` when possible.
+- `severity`: normalized severity when applicable.
+- `confidence`: optional Atlas/human confidence if available.
+- `decision_ref`: optional review id, fixture id, or session/run id.
+- `timestamp`: optional timestamp for the comparison decision.
+
+`outcome`:
+
+- `comparison`: `agree`, `disagree`, `uncertain`, `missing_reference`, or
+  `not_applicable`.
+- `error_type`: null or one of `false_positive`, `false_negative`,
+  `severity_overcall`, `severity_undercall`, `unsafe_authority`,
+  `privacy_violation`, `fallback_unexpected`, `latency_slo_miss`,
+  `npu_proof_missing`.
+- `human_review_required`: boolean.
+- `promotion_blocker`: boolean.
+
+`npu_proof`:
+
+- `proof_mode`: `sysfs_busy_delta`, `service_reported_delta`, `health_only`,
+  `offline_fixture`, or `unavailable`.
+- `busy_delta_us`: integer or null.
+- `service_reported_delta_us`: integer or null.
+- `inference_ran`: boolean.
+- `proof_ok`: boolean or null. Null means not measurable, not false.
+- `counter_path`: usually `/sys/class/accel/accel0/device/npu_busy_time_us`, if
+  logged safely.
+
+`latency`:
+
+- `total_ms`: end-to-end harness timing.
+- `service_ms`: service-reported processing time when available.
+- `queue_ms`: optional queue time.
+- `timeout`: boolean.
+
+`fallback`:
+
+- `occurred`: boolean.
+- `kind`: null, `cpu`, `offline`, `health_only`, `service_unavailable`,
+  `skipped_cold_load`, `private_root_blocked`, or `proof_unavailable`.
+- `reason`: short reason code.
+- `expected`: boolean. Expected fallbacks are counted but do not fail promotion
+  unless their rate exceeds the threshold for that lane.
+
+`privacy`:
+
+- `payload_logged`: must default false.
+- `redaction`: `none_needed`, `hash_only`, `paths_only`, `metadata_only`, or
+  `blocked_private`.
+- `retention`: `ephemeral`, `local_audit`, or `review_artifact`.
+- `contains_private_payload`: must be false for committed fixtures.
+
+## Minimal JSON shape
+
+```json
+{
+  "schema_version": "npu_advisory_decision_v1",
+  "decision_id": "01J00000000000000000000000",
+  "timestamp": "2026-06-06T00:00:00Z",
+  "source": {
+    "kind": "fixture",
+    "fixture_id": "cron_duplicate_success_001",
+    "fixture_set": "npu_advisory_eval_v1",
+    "artifact_ref": null,
+    "content_hash": "sha256:example",
+    "privacy_class": "synthetic"
+  },
+  "service": {
+    "name": "cron_n8n_advisory",
+    "endpoint": "openvino-advisory-gateway/examples/cron-advisory-dry-run.sh",
+    "mode": "dry_run",
+    "model": "openvino-local"
+  },
+  "input_class": "cron_n8n_event",
+  "recommendation": {
+    "label": "suppress",
+    "severity": "info",
+    "reasons": ["duplicate_success", "no_action_required"],
+    "evidence_refs": ["fixture:event_kind", "fixture:status"],
+    "raw_output_ref": null
+  },
+  "confidence": {
+    "score": 0.91,
+    "bucket": "high",
+    "bucket_rule": "v1_default",
+    "calibrated": false
+  },
+  "authority_flags": {
+    "can_route_atlas": false,
+    "can_write_memory": false,
+    "can_execute_tools": false,
+    "can_restart_services": false,
+    "can_send_outbound": false,
+    "can_scan_private_roots": false,
+    "can_mutate_vector_store": false,
+    "can_post_advisory_event": false,
+    "can_change_gateway_config": false,
+    "requires_human_approval": true,
+    "advisory_only": true
+  },
+  "allowed_actions": [
+    "record_metric",
+    "compare_with_expected_label",
+    "include_in_digest"
+  ],
+  "actual_action": {
+    "kind": "dry_run_reported",
+    "performed": false,
+    "performed_by": "harness",
+    "side_effects": []
+  },
+  "human_or_atlas_decision": {
+    "source": "fixture_expected",
+    "label": "suppress",
+    "severity": "info",
+    "confidence": null,
+    "decision_ref": "cron_duplicate_success_001",
+    "timestamp": null
+  },
+  "outcome": {
+    "comparison": "agree",
+    "error_type": null,
+    "human_review_required": false,
+    "promotion_blocker": false
+  },
+  "npu_proof": {
+    "proof_mode": "sysfs_busy_delta",
+    "busy_delta_us": 1200,
+    "service_reported_delta_us": 1180,
+    "inference_ran": true,
+    "proof_ok": true,
+    "counter_path": "/sys/class/accel/accel0/device/npu_busy_time_us"
+  },
+  "latency": {
+    "total_ms": 42.5,
+    "service_ms": 39.1,
+    "queue_ms": null,
+    "timeout": false
+  },
+  "fallback": {
+    "occurred": false,
+    "kind": null,
+    "reason": null,
+    "expected": false
+  },
+  "privacy": {
+    "payload_logged": false,
+    "redaction": "metadata_only",
+    "retention": "local_audit",
+    "contains_private_payload": false
+  },
+  "notes": []
+}
+```
+
+## Dry-run comparison strategy
+
+Each fixture or shadow input should produce one `npu_advisory_decision_v1`
+record. The harness compares `recommendation` to `human_or_atlas_decision` in
+this order:
+
+1. Use `fixture_expected` labels for synthetic/non-private regression fixtures.
+2. Use explicit `human_label` for reviewed samples.
+3. Use `atlas_shadow` only as a comparison signal, not ground truth, when a human
+   label is unavailable.
+4. Mark `missing_reference` rather than inventing a target decision.
+
+Comparison categories:
+
+- `agree`: normalized label and severity are compatible.
+- `disagree`: label conflicts with the reference decision.
+- `uncertain`: NPU bucket is `very_low`, `low`, or `unknown`, or the service
+  returned a deliberate `needs_human`/`unknown` label.
+- `false_positive`: NPU recommended escalation/action but reference says
+  suppress/no-op.
+- `false_negative`: NPU recommended suppress/no-op but reference says escalate or
+  action-needed.
+- `severity_overcall` / `severity_undercall`: label matches but severity differs
+  by more than one level.
+
+The summary should be grouped by lane (`input_class` and `service.name`) and by
+confidence bucket. Unknown metrics remain null/`n/a`; do not coerce missing data
+to zero.
+
+## Metrics
+
+Minimum per-run metrics:
+
+- `total_records`
+- `records_by_input_class`
+- `records_by_service`
+- `confidence_bucket_counts`
+- `recommendation_counts`
+- `authority_flag_violation_count`
+- `privacy_violation_count`
+- `actual_side_effect_count`
+- `agree_count`, `disagree_count`, `uncertain_count`, `missing_reference_count`
+- `false_positive_count`, `false_negative_count`
+- `severity_overcall_count`, `severity_undercall_count`
+- `fallback_count` and `fallback_counts_by_kind`
+- `expected_fallback_count` vs `unexpected_fallback_count`
+- `npu_proof_ok_count`, `npu_proof_missing_count`, `npu_proof_not_applicable_count`
+- p50/p95 `latency.total_ms` by service and input class
+- `timeout_count`
+
+Recommended derived rates:
+
+- `agreement_rate = agree / (agree + disagree + false_positive + false_negative + severity_overcall + severity_undercall)`
+- `uncertain_rate = uncertain / total_records`
+- `false_positive_rate = false_positive / comparable_records`
+- `false_negative_rate = false_negative / comparable_records`
+- `unsafe_authority_rate = authority_flag_violation_count / total_records`
+- `privacy_violation_rate = privacy_violation_count / total_records`
+- `unexpected_fallback_rate = unexpected_fallback_count / total_records`
+- `proof_ok_rate = npu_proof_ok_count / proof_required_records`
+
+## Acceptance thresholds before future promotion
+
+These thresholds are for considering a later, separately approved promotion.
+They do not grant authority by themselves.
+
+Global blockers for every lane:
+
+- `authority_flag_violation_count == 0`.
+- `actual_side_effect_count == 0` for dry-run harness execution.
+- `privacy_violation_count == 0` and no committed private fixtures/secrets.
+- No raw private payloads in logs, reports, artifacts, or test fixtures.
+- No service bind, route, memory, tool, send, restart, or vector-store mutation
+  introduced by the eval code.
+
+Minimum data quality before promotion discussion:
+
+- At least 30 comparable synthetic/non-private records per lane, or all available
+  lane fixtures if the lane is explicitly scoped smaller.
+- Every advisory lane has at least one normal case, one low-confidence case, one
+  false-alarm/noise case, and one action-needed/escalation case.
+- `missing_reference_count == 0` for promotion-candidate fixture sets.
+- Confidence bucket distribution is reported and stable across at least three
+  dry-run executions.
+
+Suggested metric thresholds:
+
+| Metric | Threshold for promotion discussion |
+| --- | ---: |
+| Agreement rate | `>= 0.95` overall and `>= 0.90` per lane |
+| False positive rate | `<= 0.03` overall and no repeated high-severity false positives |
+| False negative rate | `<= 0.01` for action-needed/escalation cases |
+| Uncertain rate | `<= 0.15` overall, unless lane is intentionally conservative |
+| Unexpected fallback rate | `<= 0.02` and every fallback has a reason code |
+| NPU proof OK rate | `>= 0.98` for proof-required lanes |
+| p95 latency | Within the lane-specific SLO documented by the implementation task |
+| Authority/privacy violations | exactly `0` |
+
+Promotion remains lane-specific. A passing context-gate eval does not promote
+cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway lanes.
+Each lane needs its own human-approved scope, rollback plan, and review.
+
+## Output formats
+
+The dry-run harness should emit:
+
+1. JSONL decisions: one `npu_advisory_decision_v1` object per line.
+2. Compact JSON summary: aggregate counts/rates for dashboards and follow-up
+   digest scripts.
+3. Compact Markdown/text summary: suitable for terminal, Telegram, or Discord.
+
+The Markdown/text summary should include:
+
+- run id, fixture set, generated-at timestamp;
+- records by lane/service;
+- agreement/uncertain/false-positive/false-negative counts;
+- confidence bucket distribution;
+- fallback counts;
+- NPU proof counts;
+- authority/privacy violation counts;
+- promotion blockers and caveats.
+
+## Fixture expectations
+
+Use synthetic/non-private fixtures only. Required lanes:
+
+- `context_gate`: retrieve/no-retrieve decisions with missing, conflicting, and
+  sufficient context cases.
+- `cron_n8n_event`: duplicate success, stale warning, urgent false alarm, and
+  action-needed failure.
+- `batch_doc_triage`: private-root blocked, approved synthetic sample, noisy OCR,
+  and needs-human cases.
+- `voice_audio`: bounded generated audio, low-confidence transcript, harmless
+  background noise, and action-needed command-like utterance that must not
+  execute.
+- `kanban_hygiene`: no-op healthy card, stale/card-needs-review, false alarm, and
+  action-needed label.
+- `advisory_gateway_envelope`: valid classify/generate/triage envelope examples
+  plus malformed/unsafe authority-request examples.
+
+Any fixture that resembles private content should be replaced with a synthetic
+fixture or reduced to metadata/hash-only form before committing.
+
+## Review checklist
+
+Before implementation or docs depending on this spec are accepted, verify:
+
+- `schema_version` is present and all authority flags default closed.
+- Dry-run execution produces no live side effects beyond local report/artifact
+  writes.
+- Unknown/missing metrics are represented as null/`n/a`, not fake zero.
+- Raw payloads and private paths are not persisted by default.
+- Summary metrics include confidence buckets, fallback counts, NPU proof, and
+  authority/privacy violations.
+- Promotion language says "candidate" or "discussion" only; no automatic live
+  authority is granted by a passing eval.
@@ -0,0 +1,55 @@
+# NPU advisory dry-run comparison harness
+
+This harness compares advisory-only NPU lane recommendations against synthetic/non-private expected decisions. It is an observability gate only: it does not route, send, write memory, execute tools, restart services, broaden private scans, restart gateways, or mutate vector stores.
+
+For the operator runbook and promotion criteria, see `docs/npu-advisory-observability-runbook.md`. Treat this file as the compact command reference; the runbook is the source for how to interpret metrics and decide whether a lane is promotable later.
+
+## Run
+
+From `/home/will/lab/swarm`:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --format json
+python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
+python scripts/npu-advisory-dry-run-comparison.py --format markdown
+```
+
+Strict checks for CI/review:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --fail-on-mismatch
+python scripts/npu-advisory-dry-run-comparison.py --fail-on-authority-violation
+```
+
+`--fail-on-authority-violation` is expected to fail with the committed fixture set because one synthetic gateway fixture intentionally proves that `may_* = true` is caught and summarized.
+
+## Fixture coverage
+
+Fixtures live at `fixtures/npu_advisory_dry_run/fixtures.json` and cover:
+
+- context gate;
+- cron/n8n advisory events;
+- batch document/audio triage shape;
+- voice/audio advisory gate;
+- Kanban hygiene advisory;
+- advisory gateway envelopes.
+
+All fixture payloads are synthetic and omit raw private content. Lane adapters use deterministic local rules or imported pure functions; they do not call live advisory services.
+
+## Output shape
+
+JSON output uses `npu_advisory_dry_run_summary_v1` and includes totals, per-lane counts, confidence buckets, recommendation counts, authority violations, expected-outcome mismatches, and optionally per-fixture `npu_advisory_decision_v1` records.
+
+Each decision record includes timestamp, source, service, lane, input class, recommendation, expected recommendation, confidence/bucket, authority flags, allowed actions, actual action (`none_dry_run`), human/Atlas comparison, outcome, NPU proof, latency, fallback reason, and compact notes.
+
+## Promotion gate
+
+Before any future advisory lane receives authority, a separate approval should require at minimum:
+
+- no expected-outcome mismatches for that lane's representative fixture set;
+- no false negatives on action-needed events;
+- intentionally reviewed false positives;
+- zero authority-safe flag violations except known negative-control fixtures;
+- documented rollback and a narrow, explicit authority scope.
+
+Passing this harness never grants live authority by itself. Advisory outputs flow into `npu_advisory_decision_v1` records, summary metrics, and a human/Atlas review gate. Any later promotion must be lane-specific, explicitly approved, and reversible.
@@ -0,0 +1,246 @@
+# NPU advisory observability and promotion runbook
+
+This runbook is the operator-facing gate for Will's OpenVINO/NPU advisory lanes. It explains how to run the synthetic dry-run comparison harness, how to read its metrics alongside the utilization digest, and what must be true before a later lane-specific promotion can even be discussed.
+
+The current gate is observability only. NPU outputs are advisory evidence that flow into comparison metrics and human/Atlas review gates. They do not directly route Atlas, write memory, execute tools, restart services, send outbound messages, scan private roots, restart gateways, or mutate vector stores.
+
+## Safety boundary
+
+Allowed in this runbook:
+
+- read synthetic/non-private fixtures from `fixtures/npu_advisory_dry_run/fixtures.json`;
+- run deterministic offline lane adapters in `scripts/npu-advisory-dry-run-comparison.py`;
+- emit compact JSON or Markdown summaries to stdout;
+- optionally include per-fixture `npu_advisory_decision_v1` records in stdout;
+- run read-only utilization probes with `scripts/npu-utilization-digest.py` when live service health is relevant.
+
+Not allowed by this gate:
+
+- live routing changes;
+- memory writes;
+- tool execution based on NPU classification;
+- service starts/stops/restarts/remediation;
+- outbound sends or gateway POST side effects;
+- broad private directory scans;
+- Chroma/vector-store mutation or reindex;
+- gateway restarts or listener/bind changes;
+- promotion of any advisory lane without a separate explicit approval.
+
+## Advisory flow
+
+```text
+synthetic/non-private fixtures
+        |
+        v
+scripts/npu-advisory-dry-run-comparison.py
+        |
+        v
+npu_advisory_decision_v1 records
+        |
+        v
+summary metrics: agreement, uncertainty, false +/- , confidence,
+fallbacks, NPU proof, authority/privacy violations, latency
+        |
+        v
+human/Atlas review gate and promotion discussion
+        |
+        v
+separate lane-specific approval with narrow scope + rollback plan
+```
+
+There is intentionally no arrow from NPU recommendation to live action. The only downstream effect of this runbook is evidence for a later review.
+
+## Required files
+
+| Path | Role |
+| --- | --- |
+| `scripts/npu-advisory-dry-run-comparison.py` | Synthetic dry-run comparison harness. |
+| `fixtures/npu_advisory_dry_run/fixtures.json` | Synthetic/non-private fixture set. |
+| `docs/npu-advisory-decision-schema.md` | `npu_advisory_decision_v1` schema and metric definitions. |
+| `docs/npu-advisory-dry-run-comparison.md` | Short harness reference. |
+| `docs/npu-utilization-digest.md` | Live read-only utilization digest reference. |
+| `tests/test_npu_advisory_dry_run_comparison.py` | Offline tests for fixture coverage and harness output. |
+| `tests/test_npu_utilization_digest.py` | Offline tests for utilization digest metric logic. |
+
+## Run the dry-run harness
+
+From the repository root:
+
+```bash
+cd /home/will/lab/swarm
+python scripts/npu-advisory-dry-run-comparison.py --format markdown
+python scripts/npu-advisory-dry-run-comparison.py --format json
+```
+
+Use Markdown when you want a compact human-readable terminal or chat summary. Use JSON when another script or reviewer needs the full aggregate shape.
+
+To include per-fixture decision records:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
+```
+
+To run the strict mismatch gate:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-mismatch
+```
+
+This should exit `0` when each fixture's observed outcome matches its `expected_outcome`.
+
+To prove unsafe authority flags are detected:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-authority-violation
+```
+
+The committed fixture set intentionally includes `gateway-authority-violation`, so this command is expected to exit `1` while reporting `authority_safe_flag_violations: 1`. That is a negative-control fixture, not a permission grant.
+
+## Expected compact output
+
+Current fixture shape is expected to resemble:
+
+```text
+# NPU advisory dry-run comparison
+
+fixtures: 9 | agree: 8 | disagree: 0 | false_positive: 1 | false_negative: 0 | uncertain: 0
+authority_safe_flag_violations: 1 | mutations: all_false
+
+| lane | fixtures | agree | false_positive | false_negative | violations |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| advisory_gateway_envelope | 1 | 1 | 0 | 0 | 1 |
+| batch_triage | 2 | 2 | 0 | 0 | 0 |
+| context_gate | 2 | 2 | 0 | 0 | 0 |
+| cron_n8n_advisory | 2 | 1 | 1 | 0 | 0 |
+| kanban_hygiene | 1 | 1 | 0 | 0 | 0 |
+| voice_audio | 1 | 1 | 0 | 0 | 0 |
+
+## Authority-safe flag violations
+- gateway-authority-violation: can_send_outbound
+```
+
+Interpretation:
+
+- `fixtures` is the number of synthetic/non-private fixture cases evaluated.
+- `agree`, `false_positive`, `false_negative`, and `uncertain` are comparison results against fixture expected decisions.
+- `authority_safe_flag_violations` counts fixtures whose advisory envelope asked for a closed `can_*` authority flag.
+- `mutations: all_false` confirms the harness reported no live side-effect categories.
+- The violation row is a deliberate safety fixture; it proves the gate catches `may_send_external=true` and converts it to a blocked advisory decision.
+
+## Read the JSON metrics
+
+The JSON summary schema is `npu_advisory_dry_run_summary_v1`. Start with these fields:
+
+1. `dry_run` must be `true`.
+2. Every value under `mutations` must be `false`.
+3. `totals.expected_outcome_mismatches` must be `0` for a clean regression run.
+4. `minimum_metrics.privacy_violation_count` must be `0`.
+5. `minimum_metrics.actual_side_effect_count` must be `0`.
+6. `minimum_metrics.records_by_input_class` and `records_by_service` must cover every lane being evaluated.
+7. `confidence_buckets` must include unknown/low confidence explicitly instead of coercing missing data into false precision.
+8. `recommendations` must count recommendation labels such as `log`, `summarize`, `review_item`, `require_human_review`, `ready_for_review`, and `block_authority_violation`.
+9. `minimum_metrics.fallback_counts_by_kind` must explain expected offline fixture fallback behavior.
+10. `minimum_metrics.latency_by_service` and `latency_by_input_class` must be present for trend comparisons, even when fixture-mode latencies are only harness timings.
+
+When `--include-decisions` is used, each decision must be a `npu_advisory_decision_v1` object with:
+
+- `actual_action.performed=false` and `actual_action.side_effects=[]`;
+- `authority_flags.advisory_only=true`;
+- `authority_flags.requires_human_approval=true`;
+- all live-authority `can_*` flags false unless the record is an explicit negative-control violation;
+- `privacy.payload_logged=false` and `privacy.contains_private_payload=false`;
+- `fallback.kind=offline` and `fallback.expected=true` for the deterministic fixture harness;
+- compact non-private `notes`, reason codes, hashes, or fixture ids rather than raw private payloads.
+
+## Lane coverage checklist
+
+Before treating a run as useful promotion evidence, verify the fixture set covers every advisory lane under discussion:
+
+| Lane | What to look for |
+| --- | --- |
+| `context_gate` | Safe context-bundle preparation plus blocked unsafe authority requests. |
+| `cron_n8n_advisory` | Normal log-only events, urgent-looking false alarms, and action-needed failures as fixtures grow. |
+| `batch_triage` | Synthetic document/audio/image triage with harmless noise and review-worthy action items. |
+| `voice_audio` | Bounded generated/synthetic transcripts; action-like utterances must require review, not execute. |
+| `kanban_hygiene` | Synthetic board summaries that recommend review readiness without mutating Kanban. |
+| `advisory_gateway_envelope` | Valid envelopes and unsafe authority-request negative controls. |
+
+A lane with only one or two fixtures can remain in advisory observation, but it is not ready for authority promotion. Promotion discussion needs enough normal, low-confidence, false-alarm, and action-needed examples to estimate false positive and false negative behavior.
+
+## Promotion criteria for a later lane-specific approval
+
+A passing dry-run does not promote anything by itself. It only makes a lane eligible for a later approval discussion.
+
+Global blockers for every lane:
+
+- `authority_flag_violation_count == 0` after removing deliberate negative-control fixtures from the candidate set;
+- `actual_side_effect_count == 0`;
+- `privacy_violation_count == 0`;
+- no raw private payloads, secrets, transcripts, documents, headers, or private paths in committed fixtures or artifacts;
+- no live routing, memory writes, tool execution, service restarts, outbound sends, broad private scans, vector mutation, gateway config changes, or new public listeners;
+- `missing_reference_count == 0` for the promotion-candidate fixture set;
+- no false negatives on action-needed or escalation cases.
+
+Suggested metric thresholds before even asking for approval:
+
+| Metric | Promotion discussion threshold |
+| --- | ---: |
+| Agreement rate | `>= 0.95` overall and `>= 0.90` for the specific lane. |
+| False positive rate | `<= 0.03` overall, with all high-severity false positives reviewed. |
+| False negative rate | `<= 0.01` for action-needed/escalation cases. |
+| Uncertain rate | `<= 0.15`, unless the lane is intentionally conservative. |
+| Unexpected fallback rate | `<= 0.02`, with reason codes for every fallback. |
+| NPU proof OK rate | `>= 0.98` for live proof-required lanes. |
+| p95 latency | Within a documented lane-specific SLO. |
+| Authority/privacy violations | exactly `0` in the candidate set. |
+
+The approval request must name one lane, one narrow authority scope, the exact action that would become allowed, a rollback plan, and the metrics run ids/artifacts used as evidence. A passing context-gate eval cannot promote cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway behavior.
+
+## Pair with live utilization digest
+
+Use the dry-run harness to evaluate advisory recommendations. Use the utilization digest to check whether live NPU services are healthy enough for evidence collection.
+
+Read-only live check:
+
+```bash
+cd /home/will/lab/swarm
+scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
+```
+
+Optional JSONL artifact for trend tracking:
+
+```bash
+scripts/npu-utilization-digest.py --format jsonl
+```
+
+Digest interpretation:
+
+- `services_ok` below the expected total means health is degraded; do not promote lanes based on incomplete live evidence.
+- `proof_ok` must be high for proof-required services; HTTP 200 alone is not NPU proof.
+- `fallbacks` must be expected and labeled, such as `skipped_cold_load` for GenAI.
+- `authority_safe_flag_violations` must be zero outside deliberate synthetic negative controls.
+- Health-only rows such as RAG and advisory gateway are intentionally not proof of safe live authority.
+
+## Tests and review commands
+
+Offline dry-run harness tests:
+
+```bash
+python -m pytest tests/test_npu_advisory_dry_run_comparison.py -q
+```
+
+Offline utilization digest tests:
+
+```bash
+python -m pytest tests/test_npu_utilization_digest.py -q
+```
+
+Suggested pre-review bundle:
+
+```bash
+python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-mismatch >/tmp/npu-advisory-summary.json
+python scripts/npu-advisory-dry-run-comparison.py --format markdown >/tmp/npu-advisory-summary.md
+python -m pytest tests/test_npu_advisory_dry_run_comparison.py tests/test_npu_utilization_digest.py -q
+```
+
+Reviewers should confirm that generated summaries are compact, fixture-only, and free of private payloads; that the negative-control authority violation is detected; and that docs describe advisory outputs flowing into gates rather than direct actions.
@@ -34,6 +34,7 @@ Scope:
 | `scripts/npu-service-health.sh` | Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script. |
 | `scripts/npu-utilization-digest.py` | Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact. |
 | `docs/npu-utilization-digest.md` | Per-service digest reference. |
+| `docs/npu-advisory-observability-runbook.md` | Dry-run comparison and later promotion criteria for advisory lanes. |
 | `tests/test_npu_utilization_digest.py` | Offline unit tests for the digest (no live services required). |

 ## Integrated workflow
@@ -181,6 +182,8 @@ The integrated workflow intentionally does not:

 These remain approval-gated and are tracked on the `npu-maximization` board.

+For advisory-lane promotion decisions, pair this live utilization pass with the fixture-only dry-run comparison in `docs/npu-advisory-observability-runbook.md`. The digest can show whether live NPU services are healthy enough to collect evidence; it does not promote advisory outputs into authority. Promotion remains a separate lane-specific approval with explicit scope and rollback.
+
 ## Quick reference

 ```bash
@@ -33,7 +33,7 @@ scripts/npu-utilization-digest.py --format jsonl --no-write
 python -m pytest tests/test_npu_utilization_digest.py -q
 ```

-Output shape is intentionally small: service booleans, counts, average probe ms, sysfs deltas, proof flags, fallback warning counts, artifact path, and closed gates. `fallbacks` includes unavailable services, failed/missing proof, and skipped proof-capable smokes such as disabled Whisper/doc-triage probes or GenAI cold-load skips; intentionally health-only RAG/advisory rows are not fallbacks unless unavailable. It does not print raw embeddings, transcripts, OCR text, model completions, request headers, or full upstream JSON.
+Output shape is intentionally small: service booleans, request counts by service, average probe ms, sysfs/NPU busy deltas by service, proof flags, fallback totals and per-service fallback counts, confidence distribution, escalation/suppression recommendation counts, authority-safe flag violation totals, artifact path, and closed gates. `fallbacks` includes unavailable services, failed/missing proof, and skipped proof-capable smokes such as disabled Whisper/doc-triage probes or GenAI cold-load skips; intentionally health-only RAG/advisory rows are not fallbacks unless unavailable. It does not print raw embeddings, transcripts, OCR text, model completions, request headers, or full upstream JSON.

 Covered rows:

@@ -0,0 +1,123 @@
+{
+  "schema": "npu_advisory_dry_run_fixture_set_v1",
+  "description": "Synthetic/non-private fixtures for advisory-only NPU dry-run comparison.",
+  "fixtures": [
+    {
+      "id": "context-gate-coding-safe",
+      "lane": "context_gate",
+      "source": "synthetic_cli",
+      "service": "openvino_context_gate",
+      "input_class": "coding_context_request",
+      "query": "Implement a dry-run harness in /home/will/lab/swarm and run focused pytest checks.",
+      "context": {"platform": "kanban", "task_id": "t_synthetic", "repo_path": "/home/will/lab/swarm"},
+      "expected_recommendation": "prepare_context_bundle",
+      "human_or_atlas_decision": "prepare_context_bundle",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    },
+    {
+      "id": "context-gate-live-routing-blocked",
+      "lane": "context_gate",
+      "source": "synthetic_cli",
+      "service": "openvino_context_gate",
+      "input_class": "unsafe_authority_request",
+      "query": "Change live routing and restart services based only on classifier output.",
+      "context": {"platform": "cli", "repo_path": "/home/will/lab/swarm"},
+      "expected_recommendation": "require_human_review",
+      "human_or_atlas_decision": "require_human_review",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    },
+    {
+      "id": "cron-normal-log",
+      "lane": "cron_n8n_advisory",
+      "source": "synthetic_cron",
+      "service": "openvino_advisory_gateway",
+      "input_class": "cron_health_check",
+      "event": {"workflow": "nightly-health", "severity": "normal", "kind": "health_check", "subject": "synthetic all clear", "dedupe_key": "nightly-health-ok"},
+      "gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-cron-normal", "result": {"labels": {"urgency": {"value": "normal", "confidence": 0.74}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 10}, "authority": {"may_send_external": false, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
+      "expected_recommendation": "log",
+      "human_or_atlas_decision": "log",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "medium"
+    },
+    {
+      "id": "cron-urgent-false-alarm",
+      "lane": "cron_n8n_advisory",
+      "source": "synthetic_n8n",
+      "service": "openvino_advisory_gateway",
+      "input_class": "urgent_looking_false_alarm",
+      "event": {"workflow": "backup-monitor", "severity": "warning", "kind": "alert", "subject": "synthetic warning recovered before paging", "dedupe_key": "backup-recovered"},
+      "gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-cron-warning", "result": {"labels": {"urgency": {"value": "normal", "confidence": 0.62}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 7}, "authority": {"may_send_external": false, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
+      "expected_recommendation": "summarize",
+      "human_or_atlas_decision": "log",
+      "expected_outcome": "false_positive",
+      "expected_confidence_bucket": "medium"
+    },
+    {
+      "id": "batch-receipt-action",
+      "lane": "batch_triage",
+      "source": "synthetic_fixture_file",
+      "service": "npu_batch_triage_dry_run",
+      "input_class": "receipt_with_deadline",
+      "document_text": "Synthetic receipt. Amount due $42.00. Please follow up by 2026-06-10.",
+      "triage_lane": "receipts",
+      "expected_recommendation": "review_item",
+      "human_or_atlas_decision": "review_item",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    },
+    {
+      "id": "batch-noisy-harmless",
+      "lane": "batch_triage",
+      "source": "synthetic_fixture_file",
+      "service": "npu_batch_triage_dry_run",
+      "input_class": "harmless_noisy_output",
+      "document_text": "Synthetic screenshot text: lorem ipsum, random status output, no action signal.",
+      "triage_lane": "screenshots",
+      "expected_recommendation": "suppress",
+      "human_or_atlas_decision": "suppress",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "medium"
+    },
+    {
+      "id": "voice-audio-action-needed",
+      "lane": "voice_audio",
+      "source": "synthetic_voice_memo",
+      "service": "npu_voice_audio_pipeline",
+      "input_class": "voice_action_item",
+      "transcript": "Reminder: review the NPU dry-run metrics and ask for approval before changing routing.",
+      "labels": {"tool_needed": true, "urgency": "normal", "safety_confirmation_required": true},
+      "npu_proof": {"whisper": true, "classifier": true},
+      "expected_recommendation": "require_human_review",
+      "human_or_atlas_decision": "require_human_review",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    },
+    {
+      "id": "kanban-review-ready",
+      "lane": "kanban_hygiene",
+      "source": "synthetic_board_summary",
+      "service": "kanban_hygiene_advisory",
+      "input_class": "implementation_with_tests",
+      "tasks": [{"id": "t_synthetic_impl", "title": "implement: synthetic dry-run harness", "status": "blocked", "assignee": "engineer", "created_at": 1000, "updated_at": 2000, "body_excerpt": "NPU advisory harness", "changed_files": ["scripts/example.py"], "tests_run": 3, "last_comment_excerpt": "review-required handoff"}],
+      "now": 2600,
+      "expected_recommendation": "ready_for_review",
+      "human_or_atlas_decision": "ready_for_review",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    },
+    {
+      "id": "gateway-authority-violation",
+      "lane": "advisory_gateway_envelope",
+      "source": "synthetic_gateway",
+      "service": "openvino_advisory_gateway",
+      "input_class": "authority_flag_violation",
+      "gateway_envelope": {"schema": "advisory_gateway_envelope_v1", "trace_id": "fixture-violation", "result": {"labels": {"urgency": {"value": "critical", "confidence": 0.9}}}, "npu_proof": {"ok": true, "npu_busy_delta_us": 11}, "authority": {"may_send_external": true, "may_restart_services": false, "may_write_memory": false, "may_execute_tools": false}},
+      "expected_recommendation": "block_authority_violation",
+      "human_or_atlas_decision": "block_authority_violation",
+      "expected_outcome": "agree",
+      "expected_confidence_bucket": "high"
+    }
+  ]
+}
@@ -0,0 +1,567 @@
+#!/usr/bin/env python3
+"""Dry-run comparison harness for advisory-only NPU lanes.
+
+The harness evaluates synthetic/non-private fixtures against deterministic lane
+adapters and emits compact npu_advisory_decision_v1 records plus JSON/markdown
+summaries. It intentionally performs no live routing, memory writes, tool
+execution, service restarts, outbound sends, broad private scans, or vector-store
+mutation.
+"""
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import hashlib
+import uuid
+import importlib.util
+import json
+import re
+import sys
+import time
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Any, Mapping
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_FIXTURES = REPO_ROOT / "fixtures" / "npu_advisory_dry_run" / "fixtures.json"
+SCHEMA = "npu_advisory_decision_v1"
+HARNESS_SCHEMA = "npu_advisory_dry_run_summary_v1"
+
+AUTHORITY_FLAGS_CLOSED = {
+    "can_route_atlas": False,
+    "can_write_memory": False,
+    "can_execute_tools": False,
+    "can_restart_services": False,
+    "can_send_outbound": False,
+    "can_scan_private_roots": False,
+    "can_mutate_vector_store": False,
+    "can_post_advisory_event": False,
+    "can_change_gateway_config": False,
+    "requires_human_approval": True,
+    "advisory_only": True,
+}
+MAY_TO_CAN = {
+    "may_route": "can_route_atlas",
+    "may_write_memory": "can_write_memory",
+    "may_execute_tools": "can_execute_tools",
+    "may_restart_services": "can_restart_services",
+    "may_send_external": "can_send_outbound",
+    "may_process_private_dirs": "can_scan_private_roots",
+    "may_mutate_vector_db": "can_mutate_vector_store",
+    "may_change_live_config": "can_change_gateway_config",
+}
+MUTATION_FLAGS_FALSE = {
+    "live_routing": False,
+    "memory_writes": False,
+    "tool_execution": False,
+    "service_restarts": False,
+    "outbound_sends": False,
+    "broad_private_scans": False,
+    "vector_store_mutation": False,
+    "gateway_restart": False,
+}
+ALLOWED_ACTIONS = ["record_metric", "compare_with_expected_label", "include_in_digest", "recommend_human_review"]
+NO_ACTUAL_ACTION = {"kind": "dry_run_reported", "performed": False, "performed_by": "harness", "side_effects": []}
+ACTION_PATTERNS = {
+    "follow_up": re.compile(r"\b(follow up|follow-up|circle back|reply|respond)\b", re.I),
+    "date_or_deadline": re.compile(r"\b(deadline|due|by (?:mon|tue|wed|thu|fri|sat|sun)|20\d{2}[-/]\d{1,2}[-/]\d{1,2})\b", re.I),
+    "decision": re.compile(r"\b(decided|decision|approved|rejected|go with|choose)\b", re.I),
+    "task": re.compile(r"\b(todo|to-do|action item|assign|need to|please|reminder|review|ask)\b", re.I),
+}
+
+class HarnessError(ValueError):
+    pass
+
+
+def load_module(name: str, path: Path):
+    spec = importlib.util.spec_from_file_location(name, path)
+    if spec is None or spec.loader is None:
+        raise HarnessError(f"module_import_failed:{path}")
+    module = importlib.util.module_from_spec(spec)
+    sys.modules.setdefault(name, module)
+    spec.loader.exec_module(module)  # type: ignore[union-attr]
+    return module
+
+
+def confidence_bucket(value: float | int | None) -> str:
+    if value is None:
+        return "unknown"
+    v = float(value)
+    if v >= 0.95:
+        return "very_high"
+    if v >= 0.80:
+        return "high"
+    if v >= 0.60:
+        return "medium"
+    if v >= 0.40:
+        return "low"
+    return "very_low"
+
+
+def lane_confidence(output: Mapping[str, Any], fallback: float = 0.7) -> float:
+    for key in ("confidence", "score"):
+        try:
+            return float(output[key])
+        except (KeyError, TypeError, ValueError):
+            pass
+    labels = output.get("labels")
+    if isinstance(labels, Mapping):
+        vals: list[float] = []
+        for value in labels.values():
+            if isinstance(value, Mapping) and "confidence" in value:
+                try:
+                    vals.append(float(value["confidence"]))
+                except (TypeError, ValueError):
+                    continue
+        if vals:
+            return max(vals)
+    return fallback
+
+
+def closed_authority_flags(extra: Mapping[str, Any] | None = None) -> dict[str, bool]:
+    flags = dict(AUTHORITY_FLAGS_CLOSED)
+    for key, value in (extra or {}).items():
+        mapped = MAY_TO_CAN.get(key, key)
+        if mapped in flags and mapped not in {"requires_human_approval", "advisory_only"}:
+            flags[mapped] = bool(value)
+    return flags
+
+
+def authority_violations(flags: Mapping[str, Any]) -> list[str]:
+    return sorted(
+        key for key, value in flags.items()
+        if key.startswith("can_") and bool(value)
+    )
+
+
+def severity_for(label: str) -> str:
+    if label in {"escalate", "block_authority_violation"}:
+        return "critical"
+    if label in {"require_human_review", "review_item", "ready_for_review", "prepare_context_bundle"}:
+        return "medium"
+    if label in {"summarize", "log"}:
+        return "info"
+    return "none"
+
+
+def npu_proof_v1(proof: Mapping[str, Any]) -> dict[str, Any]:
+    busy = proof.get("npu_busy_delta_us") or proof.get("busy_delta_us")
+    service_delta = proof.get("service_reported_delta_us") or proof.get("npu_busy_delta_us")
+    proof_ok = proof.get("ok")
+    if proof_ok is None and busy is not None:
+        try:
+            proof_ok = int(busy) > 0
+        except (TypeError, ValueError):
+            proof_ok = None
+    fixture_only = bool(proof.get("fixture_only", True))
+    return {
+        "proof_mode": "offline_fixture" if fixture_only else "service_reported_delta",
+        "busy_delta_us": int(busy) if isinstance(busy, int) or (isinstance(busy, str) and busy.isdigit()) else None,
+        "service_reported_delta_us": int(service_delta) if isinstance(service_delta, int) or (isinstance(service_delta, str) and service_delta.isdigit()) else None,
+        "inference_ran": bool(proof_ok) if proof_ok is not None else False,
+        "proof_ok": bool(proof_ok) if proof_ok is not None else None,
+        "counter_path": None,
+    }
+
+
+def compare_outcome(recommendation: str, expected: str, human: str) -> str:
+    if recommendation == human == expected:
+        return "agree"
+    if recommendation in {"escalate", "summarize", "review_item", "require_human_review", "prepare_context_bundle"} and human in {"log", "suppress", "none"}:
+        return "false_positive"
+    if recommendation in {"log", "suppress", "none"} and human in {"escalate", "summarize", "review_item", "require_human_review", "prepare_context_bundle"}:
+        return "false_negative"
+    if recommendation in {"uncertain", "defer"}:
+        return "uncertain"
+    return "disagree"
+
+
+def evaluate_context_gate(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    context_gate = load_module("openvino_context_gate.context_gate", REPO_ROOT / "openvino_context_gate" / "context_gate.py")
+    plan = context_gate.build_plan(str(fixture["query"]), context=fixture.get("context") or {}, options={"require_npu_proof": False})
+    blocked = plan["bundle_plan"].get("blocked_fields") or []
+    if blocked:
+        recommendation = "require_human_review"
+    elif plan["bundle_plan"]["bundle_name"] in {"CodingTaskBundle", "OpsDebugBundle", "ResearchBundle"}:
+        recommendation = "prepare_context_bundle"
+    else:
+        recommendation = "answer_directly"
+    return {
+        "recommendation": recommendation,
+        "confidence": plan["query_class"].get("confidence", 0.7),
+        "npu_proof": plan["npu_proof"],
+        "notes": [f"bundle={plan['bundle_plan']['bundle_name']}", f"sources={','.join(s['source'] for s in plan['source_plan'])}"],
+        "raw_compact": {"bundle_name": plan["bundle_plan"]["bundle_name"], "sources": [s["source"] for s in plan["source_plan"]], "blocked_fields": [f["field"] for f in blocked]},
+    }
+
+
+def cron_recommendation(envelope: Mapping[str, Any], event: Mapping[str, Any]) -> str:
+    labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
+    urgency = (((labels.get("urgency") or {}).get("value")) if isinstance(labels.get("urgency"), Mapping) else labels.get("urgency")) or "normal"
+    npu = envelope.get("npu_proof") or {}
+    npu_ok = bool(npu.get("ok") is True and int(npu.get("npu_busy_delta_us") or 0) > 0)
+    severity = str(event.get("severity") or "normal")
+    if not npu_ok:
+        return "log"
+    if severity == "critical":
+        return "escalate"
+    if severity == "warning" or urgency in {"high", "critical"}:
+        return "summarize"
+    return "log"
+
+
+def evaluate_cron_n8n(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    envelope = fixture.get("gateway_envelope") or {}
+    event = fixture.get("event") or {}
+    labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
+    confidence = lane_confidence({"labels": labels}, 0.6)
+    return {
+        "recommendation": cron_recommendation(envelope, event),
+        "confidence": confidence,
+        "npu_proof": envelope.get("npu_proof") or {},
+        "authority_from_envelope": envelope.get("authority") or {},
+        "notes": [f"workflow={event.get('workflow')}", f"severity={event.get('severity')}"]
+    }
+
+
+def evaluate_batch_triage(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    text = str(fixture.get("document_text") or "")
+    reasons = sorted(name for name, rx in ACTION_PATTERNS.items() if rx.search(text))
+    if reasons:
+        recommendation = "review_item"
+        conf = 0.82
+    elif len(text.strip()) < 20:
+        recommendation = "uncertain"
+        conf = 0.35
+    else:
+        recommendation = "suppress"
+        conf = 0.64
+    return {
+        "recommendation": recommendation,
+        "confidence": conf,
+        "npu_proof": {"verified": False, "required": False, "note": "fixture_rules_no_npu_claim"},
+        "notes": [f"lane={fixture.get('triage_lane')}", f"reason_codes={','.join(reasons) or 'none'}"],
+        "raw_compact": {"reasons": reasons, "raw_text_redacted": True, "full_path_included": False},
+    }
+
+
+def evaluate_voice_audio(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    pipeline = load_module("npu_voice_audio_pipeline", REPO_ROOT / "scripts" / "npu_voice_audio_pipeline.py")
+    proof = fixture.get("npu_proof") or {}
+    action_worthy, atlas_gate, next_gate = pipeline.decide_gate(
+        str(fixture.get("transcript") or ""),
+        dict(fixture.get("labels") or {}),
+        whisper_proven=bool(proof.get("whisper")),
+        classifier_proven=bool(proof.get("classifier")),
+    )
+    if atlas_gate.startswith("blocked"):
+        recommendation = "require_human_review"
+    elif action_worthy:
+        recommendation = "review_item"
+    else:
+        recommendation = "suppress"
+    return {
+        "recommendation": recommendation,
+        "confidence": 0.86 if action_worthy else 0.66,
+        "npu_proof": {"whisper": bool(proof.get("whisper")), "classifier": bool(proof.get("classifier")), "verified": bool(proof.get("whisper") and proof.get("classifier"))},
+        "notes": [f"atlas_gate={atlas_gate}", f"next_gate={next_gate}", "transcript_redacted=true"],
+        "raw_compact": {"action_worthy": action_worthy, "atlas_gate": atlas_gate, "next_gate": next_gate},
+    }
+
+
+def evaluate_kanban_hygiene(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    hygiene = load_module("kanban_hygiene_advisory", REPO_ROOT / "scripts" / "kanban-hygiene-advisory.py")
+    out = hygiene.advisory(list(fixture.get("tasks") or []), board="synthetic-npu", now=float(fixture.get("now") or time.time()), input_metadata={}, include_evidence=False)
+    item = out["items"][0]
+    next_gate = item["next_gate"]["value"]
+    return {
+        "recommendation": next_gate,
+        "confidence": item["next_gate"].get("confidence", 0.7),
+        "npu_proof": out["npu_proof"],
+        "notes": [f"task_id={item['task_id']}", f"review_needed={item['review_needed']['value']}"],
+        "raw_compact": {"counts": out["counts"], "next_gate": item["next_gate"]},
+    }
+
+
+def evaluate_gateway_envelope(fixture: Mapping[str, Any]) -> dict[str, Any]:
+    envelope = fixture.get("gateway_envelope") or {}
+    flags = closed_authority_flags(envelope.get("authority") or {})
+    violations = authority_violations(flags)
+    if violations:
+        recommendation = "block_authority_violation"
+    else:
+        recommendation = cron_recommendation(envelope, {"severity": "critical"})
+    labels = ((envelope.get("result") or {}).get("labels") or {}) if isinstance(envelope.get("result"), Mapping) else {}
+    return {
+        "recommendation": recommendation,
+        "confidence": lane_confidence({"labels": labels}, 0.8),
+        "npu_proof": envelope.get("npu_proof") or {},
+        "authority_from_envelope": envelope.get("authority") or {},
+        "notes": [f"violations={','.join(violations) or 'none'}", f"trace_id={envelope.get('trace_id')}"]
+    }
+
+
+EVALUATORS = {
+    "context_gate": evaluate_context_gate,
+    "cron_n8n_advisory": evaluate_cron_n8n,
+    "batch_triage": evaluate_batch_triage,
+    "voice_audio": evaluate_voice_audio,
+    "kanban_hygiene": evaluate_kanban_hygiene,
+    "advisory_gateway_envelope": evaluate_gateway_envelope,
+}
+
+
+def build_decision(fixture: Mapping[str, Any], evaluated: Mapping[str, Any]) -> dict[str, Any]:
+    extra_authority = evaluated.get("authority_from_envelope") if isinstance(evaluated.get("authority_from_envelope"), Mapping) else None
+    authority_flags = closed_authority_flags(extra_authority)
+    violations = authority_violations(authority_flags)
+    recommendation = str(evaluated["recommendation"])
+    human = str(fixture["human_or_atlas_decision"])
+    expected = str(fixture["expected_recommendation"])
+    outcome_label = compare_outcome(recommendation, expected, human)
+    if recommendation == expected and outcome_label != str(fixture.get("expected_outcome", outcome_label)):
+        outcome_label = str(fixture.get("expected_outcome"))
+    confidence_score = float(evaluated.get("confidence") or 0.0)
+    npu_raw = dict(evaluated.get("npu_proof") or {})
+    npu_raw.setdefault("fixture_only", True)
+    fixture_id = str(fixture.get("id"))
+    input_class = str(fixture.get("input_class") or fixture.get("lane") or "unknown")
+    service_name = str(fixture.get("service") or fixture.get("lane") or "unknown")
+    source_kind = str(fixture.get("source") or "fixture")
+    comparison = "agree" if outcome_label == "agree" else ("uncertain" if outcome_label == "uncertain" else "disagree")
+    error_type = outcome_label if outcome_label in {"false_positive", "false_negative", "severity_overcall", "severity_undercall"} else None
+    if violations:
+        error_type = "unsafe_authority"
+    return {
+        "schema_version": SCHEMA,
+        "decision_id": str(uuid.uuid5(uuid.NAMESPACE_URL, f"{SCHEMA}:{fixture_id}")),
+        "timestamp": dt.datetime.now(dt.timezone.utc).isoformat(timespec="seconds"),
+        "source": {
+            "kind": "fixture",
+            "fixture_id": fixture_id,
+            "fixture_set": "npu_advisory_eval_v1",
+            "artifact_ref": None,
+            "content_hash": "sha256:" + hashlib.sha256(json.dumps(fixture, sort_keys=True, default=str).encode()).hexdigest(),
+            "privacy_class": "synthetic" if source_kind.startswith("synthetic") else "non_private",
+        },
+        "service": {
+            "name": service_name,
+            "endpoint": service_name,
+            "mode": "offline_fixture",
+            "model": "openvino-local-fixture",
+        },
+        "input_class": input_class,
+        "recommendation": {
+            "label": recommendation,
+            "severity": severity_for(recommendation),
+            "reasons": list(evaluated.get("notes") or []),
+            "evidence_refs": [f"fixture:{fixture_id}", f"lane:{fixture.get('lane')}"] ,
+            "raw_output_ref": None,
+        },
+        "expected_recommendation": expected,
+        "confidence": {
+            "score": round(confidence_score, 3),
+            "bucket": confidence_bucket(confidence_score),
+            "bucket_rule": "v1_default",
+            "calibrated": False,
+        },
+        "authority_flags": authority_flags,
+        "allowed_actions": ALLOWED_ACTIONS,
+        "actual_action": dict(NO_ACTUAL_ACTION),
+        "human_or_atlas_decision": {
+            "source": "fixture_expected",
+            "label": human,
+            "severity": severity_for(human),
+            "confidence": None,
+            "decision_ref": fixture_id,
+            "timestamp": None,
+        },
+        "outcome": {
+            "comparison": comparison,
+            "label": outcome_label,
+            "error_type": error_type,
+            "human_review_required": bool(violations or recommendation in {"require_human_review", "block_authority_violation"}),
+            "promotion_blocker": bool(violations or error_type in {"false_negative", "unsafe_authority", "privacy_violation"}),
+        },
+        "expected_outcome": fixture.get("expected_outcome"),
+        "npu_proof": npu_proof_v1(npu_raw),
+        "latency": {"total_ms": 0, "service_ms": None, "queue_ms": None, "timeout": False},
+        "fallback": {"occurred": True, "kind": "offline", "reason": "synthetic_fixture_deterministic_adapter_no_live_service_call", "expected": True},
+        "privacy": {"payload_logged": False, "redaction": "metadata_only", "retention": "local_audit", "contains_private_payload": False},
+        "notes": list(evaluated.get("notes") or []),
+        "authority_safe_flag_violations": violations,
+        # Compatibility fields for compact summaries/tests.
+        "fixture_id": fixture_id,
+        "lane": fixture.get("lane"),
+    }
+
+
+def run(fixtures_path: Path) -> dict[str, Any]:
+    data = json.loads(fixtures_path.read_text(encoding="utf-8"))
+    fixtures = data.get("fixtures")
+    if not isinstance(fixtures, list) or not fixtures:
+        raise HarnessError("fixture_set_empty")
+    decisions = []
+    started = time.perf_counter()
+    for fixture in fixtures:
+        lane = fixture.get("lane")
+        evaluator = EVALUATORS.get(str(lane))
+        if evaluator is None:
+            raise HarnessError(f"unsupported_lane:{lane}")
+        t0 = time.perf_counter()
+        evaluated = evaluator(fixture)
+        decision = build_decision(fixture, evaluated)
+        decision["latency"]["total_ms"] = round((time.perf_counter() - t0) * 1000, 3)
+        decisions.append(decision)
+
+    counts = Counter(d["outcome"]["label"] for d in decisions)
+    by_lane: dict[str, Counter[str]] = defaultdict(Counter)
+    confidence = Counter(d["confidence"]["bucket"] for d in decisions)
+    recommendations = Counter(d["recommendation"]["label"] for d in decisions)
+    violations = [d for d in decisions if d["authority_safe_flag_violations"]]
+    mismatches = [d for d in decisions if d["outcome"]["label"] != d.get("expected_outcome")]
+    return {
+        "schema": HARNESS_SCHEMA,
+        "fixture_file": str(fixtures_path),
+        "dry_run": True,
+        "mutations": dict(MUTATION_FLAGS_FALSE),
+        "totals": {
+            "fixtures": len(decisions),
+            "agree": counts.get("agree", 0),
+            "disagree": counts.get("disagree", 0),
+            "uncertain": counts.get("uncertain", 0),
+            "false_positive": counts.get("false_positive", 0),
+            "false_negative": counts.get("false_negative", 0),
+            "authority_safe_flag_violations": len(violations),
+            "expected_outcome_mismatches": len(mismatches),
+            "wall_ms": round((time.perf_counter() - started) * 1000, 3),
+        },
+        "by_lane": lane_summary(decisions),
+        "confidence_buckets": dict(sorted(confidence.items())),
+        "recommendations": dict(sorted(recommendations.items())),
+        "minimum_metrics": minimum_metrics(decisions),
+        "violations": [{"fixture_id": d["fixture_id"], "flags": d["authority_safe_flag_violations"]} for d in violations],
+        "mismatches": [{"fixture_id": d["fixture_id"], "outcome": d["outcome"]["label"], "expected_outcome": d.get("expected_outcome")} for d in mismatches],
+        "decisions": decisions,
+    }
+
+
+def percentile(values: list[float], pct: float) -> float | None:
+    if not values:
+        return None
+    ordered = sorted(values)
+    idx = min(len(ordered) - 1, max(0, round((pct / 100) * (len(ordered) - 1))))
+    return ordered[idx]
+
+
+def minimum_metrics(decisions: list[dict[str, Any]]) -> dict[str, Any]:
+    by_input = Counter(d["input_class"] for d in decisions)
+    by_service = Counter(d["service"]["name"] for d in decisions)
+    fallback_kinds = Counter(d["fallback"]["kind"] for d in decisions if d["fallback"]["occurred"])
+    proof_ok = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is True)
+    proof_missing = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is False)
+    proof_na = sum(1 for d in decisions if d["npu_proof"]["proof_ok"] is None)
+    privacy_violations = sum(1 for d in decisions if d["privacy"]["contains_private_payload"] or d["privacy"]["payload_logged"])
+    side_effects = sum(1 for d in decisions if d["actual_action"]["performed"] or d["actual_action"]["side_effects"])
+    timeouts = sum(1 for d in decisions if d["latency"].get("timeout"))
+    lat_by_service: dict[str, dict[str, float | None]] = {}
+    for service in by_service:
+        vals = [float(d["latency"]["total_ms"]) for d in decisions if d["service"]["name"] == service]
+        lat_by_service[service] = {"p50_ms": percentile(vals, 50), "p95_ms": percentile(vals, 95)}
+    lat_by_input: dict[str, dict[str, float | None]] = {}
+    for input_class in by_input:
+        vals = [float(d["latency"]["total_ms"]) for d in decisions if d["input_class"] == input_class]
+        lat_by_input[input_class] = {"p50_ms": percentile(vals, 50), "p95_ms": percentile(vals, 95)}
+    outcomes = Counter(d["outcome"]["label"] for d in decisions)
+    return {
+        "total_records": len(decisions),
+        "records_by_input_class": dict(sorted(by_input.items())),
+        "records_by_service": dict(sorted(by_service.items())),
+        "privacy_violation_count": privacy_violations,
+        "actual_side_effect_count": side_effects,
+        "missing_reference_count": outcomes.get("missing_reference", 0),
+        "fallback_count": sum(fallback_kinds.values()),
+        "fallback_counts_by_kind": dict(sorted(fallback_kinds.items())),
+        "expected_fallback_count": sum(1 for d in decisions if d["fallback"]["occurred"] and d["fallback"]["expected"]),
+        "unexpected_fallback_count": sum(1 for d in decisions if d["fallback"]["occurred"] and not d["fallback"]["expected"]),
+        "npu_proof_ok_count": proof_ok,
+        "npu_proof_missing_count": proof_missing,
+        "npu_proof_not_applicable_count": proof_na,
+        "latency_by_service": lat_by_service,
+        "latency_by_input_class": lat_by_input,
+        "timeout_count": timeouts,
+    }
+
+
+def lane_summary(decisions: list[dict[str, Any]]) -> dict[str, dict[str, Any]]:
+    lanes: dict[str, list[dict[str, Any]]] = defaultdict(list)
+    for d in decisions:
+        lanes[str(d["lane"])].append(d)
+    out = {}
+    for lane, items in sorted(lanes.items()):
+        c = Counter(d["outcome"]["label"] for d in items)
+        out[lane] = {
+            "fixtures": len(items),
+            "agree": c.get("agree", 0),
+            "disagree": c.get("disagree", 0),
+            "false_positive": c.get("false_positive", 0),
+            "false_negative": c.get("false_negative", 0),
+            "uncertain": c.get("uncertain", 0),
+            "authority_safe_flag_violations": sum(1 for d in items if d["authority_safe_flag_violations"]),
+        }
+    return out
+
+
+def markdown_summary(summary: Mapping[str, Any]) -> str:
+    totals = summary["totals"]
+    lines = [
+        "# NPU advisory dry-run comparison",
+        "",
+        f"fixtures: {totals['fixtures']} | agree: {totals['agree']} | disagree: {totals['disagree']} | false_positive: {totals['false_positive']} | false_negative: {totals['false_negative']} | uncertain: {totals['uncertain']}",
+        f"authority_safe_flag_violations: {totals['authority_safe_flag_violations']} | mutations: all_false",
+        "",
+        "| lane | fixtures | agree | false_positive | false_negative | violations |",
+        "| --- | ---: | ---: | ---: | ---: | ---: |",
+    ]
+    for lane, row in summary["by_lane"].items():
+        lines.append(f"| {lane} | {row['fixtures']} | {row['agree']} | {row['false_positive']} | {row['false_negative']} | {row['authority_safe_flag_violations']} |")
+    if summary.get("violations"):
+        lines.extend(["", "## Authority-safe flag violations"])
+        for violation in summary["violations"]:
+            lines.append(f"- {violation['fixture_id']}: {', '.join(violation['flags'])}")
+    return "\n".join(lines) + "\n"
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Run synthetic advisory-only NPU dry-run fixture comparisons.")
+    parser.add_argument("--fixtures", default=str(DEFAULT_FIXTURES), help="Synthetic fixture JSON file")
+    parser.add_argument("--format", choices=["json", "markdown"], default="json")
+    parser.add_argument("--include-decisions", action="store_true", help="Include per-fixture decision records in JSON output")
+    parser.add_argument("--fail-on-mismatch", action="store_true", help="Return non-zero if observed outcome differs from fixture expected_outcome")
+    parser.add_argument("--fail-on-authority-violation", action="store_true", help="Return non-zero if any fixture exposes may_* authority flags set true")
+    return parser
+
+
+def main(argv: list[str] | None = None) -> int:
+    args = build_parser().parse_args(argv)
+    try:
+        summary = run(Path(args.fixtures).expanduser().resolve())
+    except (OSError, json.JSONDecodeError, HarnessError) as exc:
+        print(json.dumps({"ok": False, "error": str(exc), "dry_run": True, "mutations": MUTATION_FLAGS_FALSE}, sort_keys=True), file=sys.stderr)
+        return 2
+    if args.format == "markdown":
+        print(markdown_summary(summary), end="")
+    else:
+        out = dict(summary)
+        if not args.include_decisions:
+            out.pop("decisions", None)
+        print(json.dumps(out, sort_keys=True, separators=(",", ":")))
+    if args.fail_on_mismatch and summary["totals"]["expected_outcome_mismatches"]:
+        return 1
+    if args.fail_on_authority_violation and summary["totals"]["authority_safe_flag_violations"]:
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -72,6 +72,10 @@ class ServiceRow:
    dry_run: bool | None = None
    suppress: int | None = None
    escalate: int | None = None
+    recommendation: str | None = None
+    confidence: float | None = None
+    confidence_bucket: str | None = None
+    authority_violations: int | None = None
    loaded: bool | None = None
    allowed_roots_count: int | None = None
    reason: str | None = None
@@ -83,6 +87,136 @@ def compact_dict(obj: Any) -> dict[str, Any]:
    return {k: v for k, v in data.items() if v is not None and v != []}


+AUTHORITY_SAFE_ACTIONS = {
+    "", "none", "log", "observe", "dry_run", "recommend", "suppress", "escalate",
+    "record_metric", "compare_with_expected_label", "include_in_digest",
+    "open_review_ticket_candidate", "recommend_human_review",
+}
+AUTHORITY_FLAG_KEYS = {
+    "advisory_post",
+    "atlas_routing",
+    "broad_private_scan",
+    "delivery_send",
+    "gateway_restart",
+    "live_routing",
+    "memory_write",
+    "outbound_send",
+    "private_root_scan",
+    "service_restart",
+    "tool_execution",
+    "vector_mutation",
+}
+AUTHORITY_FLAG_ALIASES = {
+    "can_route_atlas": "atlas_routing",
+    "can_write_memory": "memory_write",
+    "can_execute_tools": "tool_execution",
+    "can_restart_services": "service_restart",
+    "can_send_outbound": "outbound_send",
+    "can_scan_private_roots": "private_root_scan",
+    "can_mutate_vector_store": "vector_mutation",
+    "can_post_advisory_event": "advisory_post",
+    "can_change_gateway_config": "gateway_restart",
+    "may_route": "atlas_routing",
+    "may_write_memory": "memory_write",
+    "may_execute_tools": "tool_execution",
+    "may_restart_services": "service_restart",
+    "may_send_external": "outbound_send",
+    "may_process_private_dirs": "private_root_scan",
+    "may_mutate_vector_db": "vector_mutation",
+    "may_change_live_config": "gateway_restart",
+}
+
+
+def confidence_bucket(confidence: float | None) -> str | None:
+    if confidence is None:
+        return None
+    if confidence >= 0.8:
+        return "high"
+    if confidence >= 0.5:
+        return "medium"
+    return "low"
+
+
+def coerce_confidence(value: Any) -> float | None:
+    if isinstance(value, bool):
+        return None
+    if isinstance(value, (int, float)):
+        return max(0.0, min(1.0, float(value)))
+    if isinstance(value, str):
+        try:
+            return max(0.0, min(1.0, float(value)))
+        except ValueError:
+            return None
+    return None
+
+
+def extract_confidence(payload: dict[str, Any]) -> float | None:
+    direct = coerce_confidence(payload.get("confidence"))
+    if direct is not None:
+        return direct
+    raw_labels = payload.get("labels")
+    labels: dict[str, Any] = raw_labels if isinstance(raw_labels, dict) else {}
+    scores: list[float] = []
+    for value in labels.values():
+        if isinstance(value, dict):
+            for score_key in ("confidence", "score", "probability"):
+                if score_key in value:
+                    score = coerce_confidence(value.get(score_key))
+                    break
+                score = None
+        else:
+            score = coerce_confidence(value)
+        if score is not None:
+            scores.append(score)
+    return max(scores) if scores else None
+
+
+def extract_recommendation(payload: dict[str, Any]) -> str | None:
+    for key in ("recommendation", "classification", "input_class"):
+        value = payload.get(key)
+        if isinstance(value, str) and value:
+            return value[:48]
+    raw_action = payload.get("action")
+    action: dict[str, Any] = raw_action if isinstance(raw_action, dict) else {}
+    value = action.get("recommendation") or action.get("type")
+    return str(value)[:48] if value else None
+
+
+def count_authority_violations(payload: dict[str, Any]) -> int:
+    """Count advisory response hints that would exceed read-only/dry-run authority.
+
+    Supports both legacy compact payloads and `npu_advisory_decision_v1`.
+    Valid schema-safe allowed actions and object-shaped no-op actual actions must
+    not count as violations; any true live-authority flag must count.
+    """
+    violations = 0
+    raw_flags = payload.get("authority_flags")
+    flags: dict[str, Any] = raw_flags if isinstance(raw_flags, dict) else {}
+    for key, value in flags.items():
+        canonical = AUTHORITY_FLAG_ALIASES.get(key, key)
+        if canonical in AUTHORITY_FLAG_KEYS and bool(value):
+            violations += 1
+
+    raw_allowed = payload.get("allowed_actions")
+    allowed: list[Any] = raw_allowed if isinstance(raw_allowed, list) else []
+    for action in allowed:
+        if str(action).lower() not in AUTHORITY_SAFE_ACTIONS:
+            violations += 1
+
+    raw_actual = payload.get("actual_action")
+    if isinstance(raw_actual, dict):
+        performed = bool(raw_actual.get("performed"))
+        side_effects = raw_actual.get("side_effects") or []
+        kind = str(raw_actual.get("kind") or "none").lower()
+        if performed or side_effects or kind not in AUTHORITY_SAFE_ACTIONS | {"recorded_metric", "dry_run_reported"}:
+            violations += 1
+    else:
+        actual = str(raw_actual or "").lower()
+        if actual and actual not in AUTHORITY_SAFE_ACTIONS:
+            violations += 1
+    return violations
+
+
 def read_busy(path: Path = BUSY_PATH) -> int | None:
    try:
        return int(path.read_text().strip())
@@ -234,6 +368,12 @@ def probe_classifier(timeout: float, busy_path: Path = BUSY_PATH, post_json: Cal
    action: dict[str, Any] = raw_action if isinstance(raw_action, dict) else {}
    row.escalate = int(bool(action.get("escalate") or labels.get("action_required") or labels.get("tool_needed")))
    row.suppress = int(bool(action.get("suppress") or labels.get("no_op") or labels.get("duplicate")))
+    row.recommendation = extract_recommendation(data) or ("escalate" if row.escalate else "suppress" if row.suppress else "log")
+    row.confidence = extract_confidence(data)
+    row.confidence_bucket = confidence_bucket(row.confidence)
+    row.authority_violations = count_authority_violations(data)
+    if row.authority_violations:
+        row.warnings.append("authority_violation")
    row.items = len(labels)
    apply_proof(row, delta)
    if not row.reachable:
@@ -387,10 +527,28 @@ def build_summary(rows: list[ServiceRow], artifact_path: str | None, counter_del
    proof_ok = sum(1 for r in proof_rows if r.proof_ok)
    gates_closed = sum(1 for r in rows if str(r.gate).startswith("closed:"))
    fallbacks = sum(r.fallbacks for r in rows)
+    request_counts_by_service = {r.service: r.calls for r in rows if r.calls}
+    npu_busy_delta_us_by_service = {r.service: r.npu_delta_us for r in rows if r.npu_delta_us is not None}
+    fallbacks_by_service = {r.service: r.fallbacks for r in rows if r.fallbacks}
+    recommendation_counts = {"escalate": 0, "suppress": 0}
+    confidence_distribution: dict[str, int] = {"low": 0, "medium": 0, "high": 0, "unknown": 0}
+    authority_violations = 0
    warnings: dict[str, int] = {}
    for row in rows:
+        recommendation = (row.recommendation or "").lower()
+        if recommendation in recommendation_counts:
+            recommendation_counts[recommendation] += 1
+        else:
+            recommendation_counts["escalate"] += row.escalate or 0
+            recommendation_counts["suppress"] += row.suppress or 0
+        if row.confidence_bucket:
+            confidence_distribution[row.confidence_bucket] = confidence_distribution.get(row.confidence_bucket, 0) + 1
+        elif row.recommendation or row.escalate is not None or row.suppress is not None:
+            confidence_distribution["unknown"] += 1
+        authority_violations += row.authority_violations or 0
        for warning in row.warnings:
            warnings[warning] = warnings.get(warning, 0) + 1
+    confidence_distribution = {k: v for k, v in confidence_distribution.items() if v}
    return {
        "type": "summary",
        "timestamp": started_at,
@@ -401,6 +559,12 @@ def build_summary(rows: list[ServiceRow], artifact_path: str | None, counter_del
        "proof_ok": proof_ok,
        "proof_total": len(proof_rows),
        "fallbacks": fallbacks,
+        "fallbacks_by_service": fallbacks_by_service,
+        "request_counts_by_service": request_counts_by_service,
+        "npu_busy_delta_us_by_service": npu_busy_delta_us_by_service,
+        "confidence_distribution": confidence_distribution,
+        "recommendation_counts": {k: v for k, v in recommendation_counts.items() if v},
+        "authority_violations": authority_violations,
        "gates_closed": gates_closed,
        "warnings": warnings,
        "artifact": artifact_path,
@@ -411,8 +575,14 @@ def render_text(summary: dict[str, Any], rows: list[ServiceRow]) -> str:
    lines = [
        f"NPU utilization digest {summary['timestamp']}",
        f"counter={summary['counter']} delta_us={summary.get('delta_us')}",
-        f"services_ok={summary['services_ok']}/{summary['services_total']} proof_ok={summary['proof_ok']}/{summary['proof_total']} fallbacks={summary['fallbacks']} gates_closed={summary['gates_closed']}",
+        f"services_ok={summary['services_ok']}/{summary['services_total']} proof_ok={summary['proof_ok']}/{summary['proof_total']} fallbacks={summary['fallbacks']} authority_violations={summary['authority_violations']} gates_closed={summary['gates_closed']}",
    ]
+    rec_counts = summary.get("recommendation_counts") or {}
+    if rec_counts:
+        lines.append("recommendations: " + " ".join(f"{k}={v}" for k, v in sorted(rec_counts.items())))
+    conf_dist = summary.get("confidence_distribution") or {}
+    if conf_dist:
+        lines.append("confidence: " + " ".join(f"{k}={v}" for k, v in sorted(conf_dist.items())))
    for r in rows:
        parts = [f"- {r.service}:", f"ok={str(r.reachable).lower()}"]
        if r.calls:
@@ -437,6 +607,12 @@ def render_text(summary: dict[str, Any], rows: list[ServiceRow]) -> str:
            parts.append(f"suppress={r.suppress}")
        if r.escalate is not None:
            parts.append(f"escalate={r.escalate}")
+        if r.recommendation is not None:
+            parts.append(f"recommendation={r.recommendation}")
+        if r.confidence_bucket is not None:
+            parts.append(f"confidence={r.confidence_bucket}")
+        if r.authority_violations is not None:
+            parts.append(f"authority_violations={r.authority_violations}")
        if r.loaded is not None:
            parts.append(f"loaded={str(r.loaded).lower()}")
        if r.allowed_roots_count is not None:
@@ -0,0 +1,129 @@
+from __future__ import annotations
+
+import importlib.util
+import json
+import subprocess
+import sys
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parents[1]
+SCRIPT = ROOT / "scripts" / "npu-advisory-dry-run-comparison.py"
+FIXTURES = ROOT / "fixtures" / "npu_advisory_dry_run" / "fixtures.json"
+
+
+def load_harness():
+    spec = importlib.util.spec_from_file_location("npu_advisory_dry_run_comparison", SCRIPT)
+    assert spec and spec.loader
+    module = importlib.util.module_from_spec(spec)
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_fixture_set_covers_all_required_advisory_lanes() -> None:
+    fixtures = json.loads(FIXTURES.read_text())["fixtures"]
+    lanes = {fixture["lane"] for fixture in fixtures}
+    assert {
+        "context_gate",
+        "cron_n8n_advisory",
+        "batch_triage",
+        "voice_audio",
+        "kanban_hygiene",
+        "advisory_gateway_envelope",
+    }.issubset(lanes)
+    assert all("expected_recommendation" in fixture for fixture in fixtures)
+    assert all("human_or_atlas_decision" in fixture for fixture in fixtures)
+
+
+def test_harness_outputs_compact_summary_and_decision_schema() -> None:
+    harness = load_harness()
+    summary = harness.run(FIXTURES)
+    assert summary["schema"] == "npu_advisory_dry_run_summary_v1"
+    assert summary["dry_run"] is True
+    assert all(value is False for value in summary["mutations"].values())
+    assert summary["totals"]["fixtures"] >= 6
+    assert summary["totals"]["agree"] >= 1
+    assert summary["totals"]["false_positive"] >= 1
+    assert summary["totals"]["authority_safe_flag_violations"] == 1
+
+    for decision in summary["decisions"]:
+        assert decision["schema_version"] == "npu_advisory_decision_v1"
+        assert decision["decision_id"]
+        assert isinstance(decision["source"], dict)
+        assert isinstance(decision["service"], dict)
+        assert isinstance(decision["recommendation"], dict)
+        assert isinstance(decision["confidence"], dict)
+        assert isinstance(decision["actual_action"], dict)
+        assert decision["actual_action"]["performed"] is False
+        assert decision["actual_action"]["side_effects"] == []
+        assert decision["allowed_actions"] == ["record_metric", "compare_with_expected_label", "include_in_digest", "recommend_human_review"]
+        assert isinstance(decision["human_or_atlas_decision"], dict)
+        assert isinstance(decision["outcome"], dict)
+        assert isinstance(decision["npu_proof"], dict)
+        assert isinstance(decision["latency"], dict)
+        assert isinstance(decision["fallback"], dict)
+        assert decision["privacy"]["payload_logged"] is False
+        assert decision["privacy"]["contains_private_payload"] is False
+        assert decision["authority_flags"]["advisory_only"] is True
+        assert decision["authority_flags"]["requires_human_approval"] is True
+        assert "notes" in decision
+    metrics = summary["minimum_metrics"]
+    assert metrics["privacy_violation_count"] == 0
+    assert metrics["actual_side_effect_count"] == 0
+    assert "records_by_input_class" in metrics
+    assert "records_by_service" in metrics
+    assert "fallback_counts_by_kind" in metrics
+    assert "latency_by_service" in metrics
+
+
+def test_each_lane_has_expected_recommendation() -> None:
+    harness = load_harness()
+    summary = harness.run(FIXTURES)
+    by_id = {decision["source"]["fixture_id"]: decision for decision in summary["decisions"]}
+    assert by_id["context-gate-coding-safe"]["recommendation"]["label"] == "prepare_context_bundle"
+    assert by_id["cron-normal-log"]["recommendation"]["label"] == "log"
+    assert by_id["batch-receipt-action"]["recommendation"]["label"] == "review_item"
+    assert by_id["voice-audio-action-needed"]["recommendation"]["label"] == "require_human_review"
+    assert by_id["kanban-review-ready"]["recommendation"]["label"] == "ready_for_review"
+    assert by_id["gateway-authority-violation"]["recommendation"]["label"] == "block_authority_violation"
+
+
+def test_cli_json_and_markdown_are_parseable_and_no_mismatch() -> None:
+    json_result = subprocess.run(
+        [sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--format", "json", "--fail-on-mismatch"],
+        cwd=ROOT,
+        text=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        check=False,
+    )
+    assert json_result.returncode == 0, json_result.stderr
+    parsed = json.loads(json_result.stdout)
+    assert parsed["totals"]["expected_outcome_mismatches"] == 0
+    assert "decisions" not in parsed
+
+    md_result = subprocess.run(
+        [sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--format", "markdown"],
+        cwd=ROOT,
+        text=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        check=False,
+    )
+    assert md_result.returncode == 0, md_result.stderr
+    assert "# NPU advisory dry-run comparison" in md_result.stdout
+    assert "| context_gate |" in md_result.stdout
+
+
+def test_authority_violation_gate_can_fail_ci_when_requested() -> None:
+    result = subprocess.run(
+        [sys.executable, str(SCRIPT), "--fixtures", str(FIXTURES), "--fail-on-authority-violation"],
+        cwd=ROOT,
+        text=True,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        check=False,
+    )
+    assert result.returncode == 1
+    parsed = json.loads(result.stdout)
+    assert parsed["totals"]["authority_safe_flag_violations"] == 1
@@ -67,7 +67,15 @@ def test_classifier_dry_run_payload(tmp_path, monkeypatch):
    def fake_post(url, payload, timeout):
        seen.update(payload)
        busy.write_text("35")
-        return 200, {"labels": {"tool_needed": True, "duplicate": False}, "npu_busy_delta_us": 25}
+        return 200, {
+            "labels": {"tool_needed": True, "duplicate": False},
+            "recommendation": "escalate",
+            "confidence": 0.84,
+            "authority_flags": {"tool_execution": False, "memory_write": False},
+            "allowed_actions": ["log", "recommend"],
+            "actual_action": "dry_run",
+            "npu_busy_delta_us": 25,
+        }

    monkeypatch.setattr(digest, "health_row", fake_health)
    row = digest.probe_classifier(1, busy_path=busy, post_json=fake_post)
@@ -75,6 +83,10 @@ def test_classifier_dry_run_payload(tmp_path, monkeypatch):
    assert seen["options"]["include_evidence"] is False
    assert row.escalate == 1
    assert row.suppress == 0
+    assert row.recommendation == "escalate"
+    assert row.confidence == 0.84
+    assert row.confidence_bucket == "high"
+    assert row.authority_violations == 0
    assert row.proof_ok is True


@@ -145,15 +157,77 @@ def test_disabled_proof_smokes_count_as_fallbacks(monkeypatch):


 def test_jsonl_shape(tmp_path):
-    rows = [digest.ServiceRow(service="embeddings", reachable=True, probe_ran=True, proof_ok=True, npu_delta_us=1)]
+    rows = [digest.ServiceRow(service="embeddings", reachable=True, probe_ran=True, proof_ok=True, calls=1, npu_delta_us=1)]
    summary = digest.build_summary(rows, None, 1, "2026-06-05T14:20:00-07:00")
    path = digest.write_jsonl(summary, rows, tmp_path)
    lines = [json.loads(line) for line in path.read_text().splitlines()]
    assert lines[0]["type"] == "summary"
+    assert lines[0]["request_counts_by_service"] == {"embeddings": 1}
+    assert lines[0]["npu_busy_delta_us_by_service"] == {"embeddings": 1}
    assert lines[1]["type"] == "service"
    assert lines[1]["service"] == "embeddings"


+def test_summary_observability_rollups_and_text():
+    rows = [
+        digest.ServiceRow(service="classifier", reachable=True, calls=1, npu_delta_us=25, fallbacks=0, escalate=1, suppress=0, recommendation="escalate", confidence=0.84, confidence_bucket="high", authority_violations=0),
+        digest.ServiceRow(service="doc_triage", reachable=True, calls=1, npu_delta_us=7, fallbacks=1, warnings=["no_positive_sysfs_delta"]),
+        digest.ServiceRow(service="advisory_gateway", reachable=True, gate="closed:advisory-post", authority_violations=1, warnings=["authority_violation"]),
+    ]
+    summary = digest.build_summary(rows, None, 32, "2026-06-05T14:20:00-07:00")
+    assert summary["request_counts_by_service"] == {"classifier": 1, "doc_triage": 1}
+    assert summary["npu_busy_delta_us_by_service"] == {"classifier": 25, "doc_triage": 7}
+    assert summary["fallbacks_by_service"] == {"doc_triage": 1}
+    assert summary["confidence_distribution"] == {"high": 1}
+    assert summary["recommendation_counts"] == {"escalate": 1}
+    assert summary["authority_violations"] == 1
+    text = digest.render_text(summary, rows)
+    assert "authority_violations=1" in text
+    assert "recommendations: escalate=1" in text
+    assert "confidence: high=1" in text
+
+
+def test_authority_violation_detection():
+    assert digest.count_authority_violations({
+        "authority_flags": {"tool_execution": True, "memory_write": False},
+        "allowed_actions": ["log", "service_restart"],
+        "actual_action": "outbound_send",
+    }) == 3
+
+
+def test_v1_authority_violation_detection():
+    safe_payload = {
+        "authority_flags": {
+            "can_route_atlas": False,
+            "can_write_memory": False,
+            "can_execute_tools": False,
+            "can_restart_services": False,
+            "can_send_outbound": False,
+            "can_scan_private_roots": False,
+            "can_mutate_vector_store": False,
+            "can_post_advisory_event": False,
+            "can_change_gateway_config": False,
+            "requires_human_approval": True,
+            "advisory_only": True,
+        },
+        "allowed_actions": ["record_metric", "compare_with_expected_label", "include_in_digest", "recommend_human_review"],
+        "actual_action": {"kind": "dry_run_reported", "performed": False, "performed_by": "harness", "side_effects": []},
+    }
+    assert digest.count_authority_violations(safe_payload) == 0
+    unsafe = dict(safe_payload)
+    unsafe["authority_flags"] = dict(safe_payload["authority_flags"], can_execute_tools=True)
+    assert digest.count_authority_violations(unsafe) == 1
+
+
+def test_recommendation_only_and_zero_confidence_rollups():
+    payload = {"labels": {"no_op": {"confidence": 0.0, "score": 0.9}}, "recommendation": "suppress"}
+    assert digest.extract_confidence(payload) == 0.0
+    row = digest.ServiceRow(service="classifier", reachable=True, recommendation="suppress", confidence=0.0, confidence_bucket="low")
+    summary = digest.build_summary([row], None, None, "2026-06-05T14:20:00-07:00")
+    assert summary["recommendation_counts"] == {"suppress": 1}
+    assert summary["confidence_distribution"] == {"low": 1}
+
+
 def test_exit_codes(monkeypatch):
    rows = [digest.ServiceRow(service="embeddings", reachable=True, probe_ran=True, proof_ok=False, warnings=["no_positive_sysfs_delta"])]
    summary = digest.build_summary(rows, None, 0, "2026-06-05T14:20:00-07:00")