docs(npu): document advisory observability gates
Add operator runbook and link integrated health docs for advisory-only observability, dry-run metrics, and future promotion criteria.
This commit is contained in:
@@ -0,0 +1,246 @@
|
||||
# NPU advisory observability and promotion runbook
|
||||
|
||||
This runbook is the operator-facing gate for Will's OpenVINO/NPU advisory lanes. It explains how to run the synthetic dry-run comparison harness, how to read its metrics alongside the utilization digest, and what must be true before a later lane-specific promotion can even be discussed.
|
||||
|
||||
The current gate is observability only. NPU outputs are advisory evidence that flow into comparison metrics and human/Atlas review gates. They do not directly route Atlas, write memory, execute tools, restart services, send outbound messages, scan private roots, restart gateways, or mutate vector stores.
|
||||
|
||||
## Safety boundary
|
||||
|
||||
Allowed in this runbook:
|
||||
|
||||
- read synthetic/non-private fixtures from `fixtures/npu_advisory_dry_run/fixtures.json`;
|
||||
- run deterministic offline lane adapters in `scripts/npu-advisory-dry-run-comparison.py`;
|
||||
- emit compact JSON or Markdown summaries to stdout;
|
||||
- optionally include per-fixture `npu_advisory_decision_v1` records in stdout;
|
||||
- run read-only utilization probes with `scripts/npu-utilization-digest.py` when live service health is relevant.
|
||||
|
||||
Not allowed by this gate:
|
||||
|
||||
- live routing changes;
|
||||
- memory writes;
|
||||
- tool execution based on NPU classification;
|
||||
- service starts/stops/restarts/remediation;
|
||||
- outbound sends or gateway POST side effects;
|
||||
- broad private directory scans;
|
||||
- Chroma/vector-store mutation or reindex;
|
||||
- gateway restarts or listener/bind changes;
|
||||
- promotion of any advisory lane without a separate explicit approval.
|
||||
|
||||
## Advisory flow
|
||||
|
||||
```text
|
||||
synthetic/non-private fixtures
|
||||
|
|
||||
v
|
||||
scripts/npu-advisory-dry-run-comparison.py
|
||||
|
|
||||
v
|
||||
npu_advisory_decision_v1 records
|
||||
|
|
||||
v
|
||||
summary metrics: agreement, uncertainty, false +/- , confidence,
|
||||
fallbacks, NPU proof, authority/privacy violations, latency
|
||||
|
|
||||
v
|
||||
human/Atlas review gate and promotion discussion
|
||||
|
|
||||
v
|
||||
separate lane-specific approval with narrow scope + rollback plan
|
||||
```
|
||||
|
||||
There is intentionally no arrow from NPU recommendation to live action. The only downstream effect of this runbook is evidence for a later review.
|
||||
|
||||
## Required files
|
||||
|
||||
| Path | Role |
|
||||
| --- | --- |
|
||||
| `scripts/npu-advisory-dry-run-comparison.py` | Synthetic dry-run comparison harness. |
|
||||
| `fixtures/npu_advisory_dry_run/fixtures.json` | Synthetic/non-private fixture set. |
|
||||
| `docs/npu-advisory-decision-schema.md` | `npu_advisory_decision_v1` schema and metric definitions. |
|
||||
| `docs/npu-advisory-dry-run-comparison.md` | Short harness reference. |
|
||||
| `docs/npu-utilization-digest.md` | Live read-only utilization digest reference. |
|
||||
| `tests/test_npu_advisory_dry_run_comparison.py` | Offline tests for fixture coverage and harness output. |
|
||||
| `tests/test_npu_utilization_digest.py` | Offline tests for utilization digest metric logic. |
|
||||
|
||||
## Run the dry-run harness
|
||||
|
||||
From the repository root:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format markdown
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json
|
||||
```
|
||||
|
||||
Use Markdown when you want a compact human-readable terminal or chat summary. Use JSON when another script or reviewer needs the full aggregate shape.
|
||||
|
||||
To include per-fixture decision records:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json --include-decisions
|
||||
```
|
||||
|
||||
To run the strict mismatch gate:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-mismatch
|
||||
```
|
||||
|
||||
This should exit `0` when each fixture's observed outcome matches its `expected_outcome`.
|
||||
|
||||
To prove unsafe authority flags are detected:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-authority-violation
|
||||
```
|
||||
|
||||
The committed fixture set intentionally includes `gateway-authority-violation`, so this command is expected to exit `1` while reporting `authority_safe_flag_violations: 1`. That is a negative-control fixture, not a permission grant.
|
||||
|
||||
## Expected compact output
|
||||
|
||||
Current fixture shape is expected to resemble:
|
||||
|
||||
```text
|
||||
# NPU advisory dry-run comparison
|
||||
|
||||
fixtures: 9 | agree: 8 | disagree: 0 | false_positive: 1 | false_negative: 0 | uncertain: 0
|
||||
authority_safe_flag_violations: 1 | mutations: all_false
|
||||
|
||||
| lane | fixtures | agree | false_positive | false_negative | violations |
|
||||
| --- | ---: | ---: | ---: | ---: | ---: |
|
||||
| advisory_gateway_envelope | 1 | 1 | 0 | 0 | 1 |
|
||||
| batch_triage | 2 | 2 | 0 | 0 | 0 |
|
||||
| context_gate | 2 | 2 | 0 | 0 | 0 |
|
||||
| cron_n8n_advisory | 2 | 1 | 1 | 0 | 0 |
|
||||
| kanban_hygiene | 1 | 1 | 0 | 0 | 0 |
|
||||
| voice_audio | 1 | 1 | 0 | 0 | 0 |
|
||||
|
||||
## Authority-safe flag violations
|
||||
- gateway-authority-violation: can_send_outbound
|
||||
```
|
||||
|
||||
Interpretation:
|
||||
|
||||
- `fixtures` is the number of synthetic/non-private fixture cases evaluated.
|
||||
- `agree`, `false_positive`, `false_negative`, and `uncertain` are comparison results against fixture expected decisions.
|
||||
- `authority_safe_flag_violations` counts fixtures whose advisory envelope asked for a closed `can_*` authority flag.
|
||||
- `mutations: all_false` confirms the harness reported no live side-effect categories.
|
||||
- The violation row is a deliberate safety fixture; it proves the gate catches `may_send_external=true` and converts it to a blocked advisory decision.
|
||||
|
||||
## Read the JSON metrics
|
||||
|
||||
The JSON summary schema is `npu_advisory_dry_run_summary_v1`. Start with these fields:
|
||||
|
||||
1. `dry_run` must be `true`.
|
||||
2. Every value under `mutations` must be `false`.
|
||||
3. `totals.expected_outcome_mismatches` must be `0` for a clean regression run.
|
||||
4. `minimum_metrics.privacy_violation_count` must be `0`.
|
||||
5. `minimum_metrics.actual_side_effect_count` must be `0`.
|
||||
6. `minimum_metrics.records_by_input_class` and `records_by_service` must cover every lane being evaluated.
|
||||
7. `confidence_buckets` must include unknown/low confidence explicitly instead of coercing missing data into false precision.
|
||||
8. `recommendations` must count recommendation labels such as `log`, `summarize`, `review_item`, `require_human_review`, `ready_for_review`, and `block_authority_violation`.
|
||||
9. `minimum_metrics.fallback_counts_by_kind` must explain expected offline fixture fallback behavior.
|
||||
10. `minimum_metrics.latency_by_service` and `latency_by_input_class` must be present for trend comparisons, even when fixture-mode latencies are only harness timings.
|
||||
|
||||
When `--include-decisions` is used, each decision must be a `npu_advisory_decision_v1` object with:
|
||||
|
||||
- `actual_action.performed=false` and `actual_action.side_effects=[]`;
|
||||
- `authority_flags.advisory_only=true`;
|
||||
- `authority_flags.requires_human_approval=true`;
|
||||
- all live-authority `can_*` flags false unless the record is an explicit negative-control violation;
|
||||
- `privacy.payload_logged=false` and `privacy.contains_private_payload=false`;
|
||||
- `fallback.kind=offline` and `fallback.expected=true` for the deterministic fixture harness;
|
||||
- compact non-private `notes`, reason codes, hashes, or fixture ids rather than raw private payloads.
|
||||
|
||||
## Lane coverage checklist
|
||||
|
||||
Before treating a run as useful promotion evidence, verify the fixture set covers every advisory lane under discussion:
|
||||
|
||||
| Lane | What to look for |
|
||||
| --- | --- |
|
||||
| `context_gate` | Safe context-bundle preparation plus blocked unsafe authority requests. |
|
||||
| `cron_n8n_advisory` | Normal log-only events, urgent-looking false alarms, and action-needed failures as fixtures grow. |
|
||||
| `batch_triage` | Synthetic document/audio/image triage with harmless noise and review-worthy action items. |
|
||||
| `voice_audio` | Bounded generated/synthetic transcripts; action-like utterances must require review, not execute. |
|
||||
| `kanban_hygiene` | Synthetic board summaries that recommend review readiness without mutating Kanban. |
|
||||
| `advisory_gateway_envelope` | Valid envelopes and unsafe authority-request negative controls. |
|
||||
|
||||
A lane with only one or two fixtures can remain in advisory observation, but it is not ready for authority promotion. Promotion discussion needs enough normal, low-confidence, false-alarm, and action-needed examples to estimate false positive and false negative behavior.
|
||||
|
||||
## Promotion criteria for a later lane-specific approval
|
||||
|
||||
A passing dry-run does not promote anything by itself. It only makes a lane eligible for a later approval discussion.
|
||||
|
||||
Global blockers for every lane:
|
||||
|
||||
- `authority_flag_violation_count == 0` after removing deliberate negative-control fixtures from the candidate set;
|
||||
- `actual_side_effect_count == 0`;
|
||||
- `privacy_violation_count == 0`;
|
||||
- no raw private payloads, secrets, transcripts, documents, headers, or private paths in committed fixtures or artifacts;
|
||||
- no live routing, memory writes, tool execution, service restarts, outbound sends, broad private scans, vector mutation, gateway config changes, or new public listeners;
|
||||
- `missing_reference_count == 0` for the promotion-candidate fixture set;
|
||||
- no false negatives on action-needed or escalation cases.
|
||||
|
||||
Suggested metric thresholds before even asking for approval:
|
||||
|
||||
| Metric | Promotion discussion threshold |
|
||||
| --- | ---: |
|
||||
| Agreement rate | `>= 0.95` overall and `>= 0.90` for the specific lane. |
|
||||
| False positive rate | `<= 0.03` overall, with all high-severity false positives reviewed. |
|
||||
| False negative rate | `<= 0.01` for action-needed/escalation cases. |
|
||||
| Uncertain rate | `<= 0.15`, unless the lane is intentionally conservative. |
|
||||
| Unexpected fallback rate | `<= 0.02`, with reason codes for every fallback. |
|
||||
| NPU proof OK rate | `>= 0.98` for live proof-required lanes. |
|
||||
| p95 latency | Within a documented lane-specific SLO. |
|
||||
| Authority/privacy violations | exactly `0` in the candidate set. |
|
||||
|
||||
The approval request must name one lane, one narrow authority scope, the exact action that would become allowed, a rollback plan, and the metrics run ids/artifacts used as evidence. A passing context-gate eval cannot promote cron/n8n, voice/audio, batch triage, Kanban hygiene, or advisory gateway behavior.
|
||||
|
||||
## Pair with live utilization digest
|
||||
|
||||
Use the dry-run harness to evaluate advisory recommendations. Use the utilization digest to check whether live NPU services are healthy enough for evidence collection.
|
||||
|
||||
Read-only live check:
|
||||
|
||||
```bash
|
||||
cd /home/will/lab/swarm
|
||||
scripts/npu-utilization-digest.py --no-write --include-genai-smoke false --format text
|
||||
```
|
||||
|
||||
Optional JSONL artifact for trend tracking:
|
||||
|
||||
```bash
|
||||
scripts/npu-utilization-digest.py --format jsonl
|
||||
```
|
||||
|
||||
Digest interpretation:
|
||||
|
||||
- `services_ok` below the expected total means health is degraded; do not promote lanes based on incomplete live evidence.
|
||||
- `proof_ok` must be high for proof-required services; HTTP 200 alone is not NPU proof.
|
||||
- `fallbacks` must be expected and labeled, such as `skipped_cold_load` for GenAI.
|
||||
- `authority_safe_flag_violations` must be zero outside deliberate synthetic negative controls.
|
||||
- Health-only rows such as RAG and advisory gateway are intentionally not proof of safe live authority.
|
||||
|
||||
## Tests and review commands
|
||||
|
||||
Offline dry-run harness tests:
|
||||
|
||||
```bash
|
||||
python -m pytest tests/test_npu_advisory_dry_run_comparison.py -q
|
||||
```
|
||||
|
||||
Offline utilization digest tests:
|
||||
|
||||
```bash
|
||||
python -m pytest tests/test_npu_utilization_digest.py -q
|
||||
```
|
||||
|
||||
Suggested pre-review bundle:
|
||||
|
||||
```bash
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format json --fail-on-mismatch >/tmp/npu-advisory-summary.json
|
||||
python scripts/npu-advisory-dry-run-comparison.py --format markdown >/tmp/npu-advisory-summary.md
|
||||
python -m pytest tests/test_npu_advisory_dry_run_comparison.py tests/test_npu_utilization_digest.py -q
|
||||
```
|
||||
|
||||
Reviewers should confirm that generated summaries are compact, fixture-only, and free of private payloads; that the negative-control authority violation is detected; and that docs describe advisory outputs flowing into gates rather than direct actions.
|
||||
@@ -34,6 +34,7 @@ Scope:
|
||||
| `scripts/npu-service-health.sh` | Listener / systemd / Docker / health endpoint / single embedding proof. Existing baseline script. |
|
||||
| `scripts/npu-utilization-digest.py` | Per-service utilization digest with NPU proof per probe, compact text or JSONL output, optional JSONL artifact. |
|
||||
| `docs/npu-utilization-digest.md` | Per-service digest reference. |
|
||||
| `docs/npu-advisory-observability-runbook.md` | Dry-run comparison and later promotion criteria for advisory lanes. |
|
||||
| `tests/test_npu_utilization_digest.py` | Offline unit tests for the digest (no live services required). |
|
||||
|
||||
## Integrated workflow
|
||||
@@ -181,6 +182,8 @@ The integrated workflow intentionally does not:
|
||||
|
||||
These remain approval-gated and are tracked on the `npu-maximization` board.
|
||||
|
||||
For advisory-lane promotion decisions, pair this live utilization pass with the fixture-only dry-run comparison in `docs/npu-advisory-observability-runbook.md`. The digest can show whether live NPU services are healthy enough to collect evidence; it does not promote advisory outputs into authority. Promotion remains a separate lane-specific approval with explicit scope and rollback.
|
||||
|
||||
## Quick reference
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user