feat(audit): add reaction-rate drift gate thresholds
Add optional reaction match-drop and skip-increase drift thresholds, expose CLI flags, and enable conservative defaults in cadence package scripts. Includes tests and docs/state sync.
This commit is contained in:
@@ -1644,6 +1644,7 @@ Backend drift/freshness gate for backend-scoped artifacts (`pi_embedded` vs `nat
|
|||||||
```bash
|
```bash
|
||||||
pnpm audit:phase0-baseline:live:drift
|
pnpm audit:phase0-baseline:live:drift
|
||||||
```
|
```
|
||||||
|
Optional drift thresholds now include reaction decision deltas (`--max-reaction-match-rate-drop-pp`, `--max-reaction-skip-rate-increase-pp`) in addition to run/cancel/error/cancel-latency thresholds.
|
||||||
This command writes drift reports to:
|
This command writes drift reports to:
|
||||||
- `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.md`
|
- `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.md`
|
||||||
- `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.json`
|
- `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.json`
|
||||||
|
|||||||
@@ -23,7 +23,7 @@ The gateway provides:
|
|||||||
- **HTTP Server**: Serves static dashboard and handles webhook endpoints
|
- **HTTP Server**: Serves static dashboard and handles webhook endpoints
|
||||||
- **Node Capability Negotiation**: Optional companion-node role/capability registration
|
- **Node Capability Negotiation**: Optional companion-node role/capability registration
|
||||||
|
|
||||||
Operational note: onboarding (`flynn setup` / `flynn onboard`) now runs post-save live readiness checks (model/channel/memory/automation) and prints a guided first-success task flow. Companion CLI now also supports bootstrap-manifest export (`flynn companion --export-bootstrap <path|->`), release-bundle export (`--export-release-bundle <dir>` with optional `--signing-key`/`--signing-key-id` signature output), release-bundle verification (`--verify-release-bundle <dir>` with optional `--verify-signing-key`/`--verify-signing-key-id`/`--require-signature`), platform shell-template export (`--export-shell-template <dir>`), plus richer shell bootstrap flags for status/location/push (`--app-version`, `--latitude/--longitude`, `--push-token`, etc.) for desktop/mobile app packaging without changing JSON-RPC method/event shapes. Audit observability now includes live phase-0 baseline capture flows: `pnpm audit:phase0-baseline:live` for channel-origin windows, backend-scoped variants (`pnpm audit:phase0-baseline:live:pi` / `pnpm audit:phase0-baseline:live:native`) via `--backend`, `pnpm audit:phase0-baseline:live:gateway` (auto-detected cancel window) for gateway-origin windows, `pnpm audit:phase0-baseline:live:refresh` for one-shot refresh of all live windows (channel + gateway + backend-scoped), `pnpm audit:phase0-baseline:live:drift` for backend artifact freshness/drift gates (writing `phase0_baseline_live_backend_drift_<tag>.md/.json` reports), `pnpm audit:phase0-baseline:live:refresh:drift:rolling` for cadence runs that stamp each capture with a unique UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) so drift comparisons can immediately use a prior snapshot (or externally supplied `TAG`), `pnpm audit:phase0-baseline:live:prune` / `pnpm audit:phase0-baseline:live:prune:apply` for rolling-tag artifact retention management (writing `phase0_baseline_live_prune_<tag>.md/.json` reports and retaining those prune reports as part of managed rolling families), and `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` for one-command cadence refresh+drift+retention apply reusing the same rolling tag (non-negative integer `KEEP_PER_FAMILY` override supported for retention depth). These scripts default to current UTC-date tags unless `--tag` is explicitly provided.
|
Operational note: onboarding (`flynn setup` / `flynn onboard`) now runs post-save live readiness checks (model/channel/memory/automation) and prints a guided first-success task flow. Companion CLI now also supports bootstrap-manifest export (`flynn companion --export-bootstrap <path|->`), release-bundle export (`--export-release-bundle <dir>` with optional `--signing-key`/`--signing-key-id` signature output), release-bundle verification (`--verify-release-bundle <dir>` with optional `--verify-signing-key`/`--verify-signing-key-id`/`--require-signature`), platform shell-template export (`--export-shell-template <dir>`), plus richer shell bootstrap flags for status/location/push (`--app-version`, `--latitude/--longitude`, `--push-token`, etc.) for desktop/mobile app packaging without changing JSON-RPC method/event shapes. Audit observability now includes live phase-0 baseline capture flows: `pnpm audit:phase0-baseline:live` for channel-origin windows, backend-scoped variants (`pnpm audit:phase0-baseline:live:pi` / `pnpm audit:phase0-baseline:live:native`) via `--backend`, `pnpm audit:phase0-baseline:live:gateway` (auto-detected cancel window) for gateway-origin windows, `pnpm audit:phase0-baseline:live:refresh` for one-shot refresh of all live windows (channel + gateway + backend-scoped), `pnpm audit:phase0-baseline:live:drift` for backend artifact freshness/drift gates (including optional reaction-rate thresholds, writing `phase0_baseline_live_backend_drift_<tag>.md/.json` reports), `pnpm audit:phase0-baseline:live:refresh:drift:rolling` for cadence runs that stamp each capture with a unique UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) so drift comparisons can immediately use a prior snapshot (or externally supplied `TAG`), `pnpm audit:phase0-baseline:live:prune` / `pnpm audit:phase0-baseline:live:prune:apply` for rolling-tag artifact retention management (writing `phase0_baseline_live_prune_<tag>.md/.json` reports and retaining those prune reports as part of managed rolling families), and `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` for one-command cadence refresh+drift+retention apply reusing the same rolling tag (non-negative integer `KEEP_PER_FAMILY` override supported for retention depth). These scripts default to current UTC-date tags unless `--tag` is explicitly provided.
|
||||||
|
|
||||||
### Execution Model (Sessions + Per-Session Queue)
|
### Execution Model (Sessions + Per-Session Queue)
|
||||||
|
|
||||||
|
|||||||
@@ -170,7 +170,7 @@ Gateway streaming UX signals:
|
|||||||
- `pnpm audit:phase0-baseline:live:pi` and `pnpm audit:phase0-baseline:live:native` capture backend-scoped channel windows using `backend.route` timelines.
|
- `pnpm audit:phase0-baseline:live:pi` and `pnpm audit:phase0-baseline:live:native` capture backend-scoped channel windows using `backend.route` timelines.
|
||||||
- `pnpm audit:phase0-baseline:live:gateway` captures gateway-origin baseline windows by auto-selecting the latest cancel/cancelled session window (or use `scripts/capture-phase0-live-baseline.ts --source gateway --since ... --until ...` for explicit windows).
|
- `pnpm audit:phase0-baseline:live:gateway` captures gateway-origin baseline windows by auto-selecting the latest cancel/cancelled session window (or use `scripts/capture-phase0-live-baseline.ts --source gateway --since ... --until ...` for explicit windows).
|
||||||
- `pnpm audit:phase0-baseline:live:refresh` runs channel + gateway + backend-scoped (`pi_embedded` and `native`) capture commands in one cadence step.
|
- `pnpm audit:phase0-baseline:live:refresh` runs channel + gateway + backend-scoped (`pi_embedded` and `native`) capture commands in one cadence step.
|
||||||
- `pnpm audit:phase0-baseline:live:drift` evaluates backend-scoped artifact freshness/drift gates and writes `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.md/.json`; `pnpm audit:phase0-baseline:live:refresh:drift` runs capture + drift checks in one cadence step.
|
- `pnpm audit:phase0-baseline:live:drift` evaluates backend-scoped artifact freshness/drift gates (including optional reaction match/skip rate thresholds) and writes `docs/plans/artifacts/phase0_baseline_live_backend_drift_<UTC-date>.md/.json`; `pnpm audit:phase0-baseline:live:refresh:drift` runs capture + drift checks in one cadence step.
|
||||||
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling` runs the same full refresh+drift flow with a shared UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) so each cadence run keeps distinct backend/drift artifacts for immediate baseline-vs-prior comparisons.
|
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling` runs the same full refresh+drift flow with a shared UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) so each cadence run keeps distinct backend/drift artifacts for immediate baseline-vs-prior comparisons.
|
||||||
- `pnpm audit:phase0-baseline:live:prune` provides dry-run retention planning for rolling-tag artifacts; `pnpm audit:phase0-baseline:live:prune:apply` deletes older rolling snapshots while keeping the newest tags per artifact family.
|
- `pnpm audit:phase0-baseline:live:prune` provides dry-run retention planning for rolling-tag artifacts; `pnpm audit:phase0-baseline:live:prune:apply` deletes older rolling snapshots while keeping the newest tags per artifact family.
|
||||||
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` now reuses the rolling refresh+drift pipeline via shared `TAG` env wiring, then applies retention (non-negative integer `KEEP_PER_FAMILY`) and writes prune reports tagged to that same rolling run (`phase0_baseline_live_prune_<tag>.md/.json`).
|
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` now reuses the rolling refresh+drift pipeline via shared `TAG` env wiring, then applies retention (non-negative integer `KEEP_PER_FAMILY`) and writes prune reports tagged to that same rolling run (`phase0_baseline_live_prune_<tag>.md/.json`).
|
||||||
|
|||||||
@@ -35,7 +35,7 @@ If you only want the protocol surface, see `docs/api/PROTOCOL.md`.
|
|||||||
- Backend-scoped channel snapshots can be regenerated with `pnpm audit:phase0-baseline:live:pi` / `pnpm audit:phase0-baseline:live:native` (`--backend` filtering via `backend.route` timelines).
|
- Backend-scoped channel snapshots can be regenerated with `pnpm audit:phase0-baseline:live:pi` / `pnpm audit:phase0-baseline:live:native` (`--backend` filtering via `backend.route` timelines).
|
||||||
- Gateway-origin phase-0 windows (including cancel-path samples) can be captured with `pnpm audit:phase0-baseline:live:gateway` (auto-detect latest cancel window) or `scripts/capture-phase0-live-baseline.ts --source gateway --since ... --until ...` for explicit bounds.
|
- Gateway-origin phase-0 windows (including cancel-path samples) can be captured with `pnpm audit:phase0-baseline:live:gateway` (auto-detect latest cancel window) or `scripts/capture-phase0-live-baseline.ts --source gateway --since ... --until ...` for explicit bounds.
|
||||||
- `pnpm audit:phase0-baseline:live:refresh` runs channel + gateway + backend-scoped (`pi_embedded` and `native`) capture paths in one command.
|
- `pnpm audit:phase0-baseline:live:refresh` runs channel + gateway + backend-scoped (`pi_embedded` and `native`) capture paths in one command.
|
||||||
- `pnpm audit:phase0-baseline:live:drift` checks backend-scoped artifact freshness/drift gates and writes `phase0_baseline_live_backend_drift_<UTC-date>.md/.json`; `pnpm audit:phase0-baseline:live:refresh:drift` chains refresh + drift checks for scheduled cadence runs.
|
- `pnpm audit:phase0-baseline:live:drift` checks backend-scoped artifact freshness/drift gates (including optional reaction match/skip rate thresholds) and writes `phase0_baseline_live_backend_drift_<UTC-date>.md/.json`; `pnpm audit:phase0-baseline:live:refresh:drift` chains refresh + drift checks for scheduled cadence runs.
|
||||||
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling` performs the same chain using one UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) across channel/gateway/backend/drift outputs so each cadence run preserves a distinct comparison point.
|
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling` performs the same chain using one UTC timestamp tag (`YYYY-MM-DD-HHMMSS`) across channel/gateway/backend/drift outputs so each cadence run preserves a distinct comparison point.
|
||||||
- `pnpm audit:phase0-baseline:live:prune` (dry-run) and `pnpm audit:phase0-baseline:live:prune:apply` (delete) manage retention of rolling-tag artifacts to control artifact growth while preserving newest snapshots per family.
|
- `pnpm audit:phase0-baseline:live:prune` (dry-run) and `pnpm audit:phase0-baseline:live:prune:apply` (delete) manage retention of rolling-tag artifacts to control artifact growth while preserving newest snapshots per family.
|
||||||
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` combines rolling refresh+drift with retention apply for one-command cron scheduling using a shared `TAG`; adjust retention depth with non-negative integer `KEEP_PER_FAMILY` and use generated `phase0_baseline_live_prune_<tag>.md/.json` artifacts for retention audit traceability.
|
- `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` combines rolling refresh+drift with retention apply for one-command cron scheduling using a shared `TAG`; adjust retention depth with non-negative integer `KEEP_PER_FAMILY` and use generated `phase0_baseline_live_prune_<tag>.md/.json` artifacts for retention audit traceability.
|
||||||
|
|||||||
@@ -203,7 +203,7 @@ Phase 0 is complete when:
|
|||||||
2. A baseline summary artifact is generated and committed under `docs/plans/artifacts/`.
|
2. A baseline summary artifact is generated and committed under `docs/plans/artifacts/`.
|
||||||
3. No user-visible response behavior changed compared to pre-phase baseline.
|
3. No user-visible response behavior changed compared to pre-phase baseline.
|
||||||
|
|
||||||
Follow-up status (2026-02-27): live channel-session artifacts exist under `docs/plans/artifacts/phase0_baseline_live_2026-02-27.*` via `pnpm audit:phase0-baseline:live` (anonymized IDs), and a second gateway-origin live window (including `run.cancel` + `cancel_requested`/`cancelled`) exists under `docs/plans/artifacts/phase0_baseline_live_gateway_2026-02-27.*`. Gateway window refreshes can now run via `pnpm audit:phase0-baseline:live:gateway` (auto-selected cancel window), all live windows can be refreshed together with `pnpm audit:phase0-baseline:live:refresh` (channel + gateway + backend-scoped `pi`/`native`; scheduling example included in README), backend artifact freshness/drift checks are now available via `pnpm audit:phase0-baseline:live:drift` (or chained with `pnpm audit:phase0-baseline:live:refresh:drift`) with drift report artifacts written to `docs/plans/artifacts/phase0_baseline_live_backend_drift_<tag>.{md,json}`, cadence runs can preserve distinct timestamped comparison points via `pnpm audit:phase0-baseline:live:refresh:drift:rolling` (supports shared `TAG` override), rolling-tag retention can be managed via `pnpm audit:phase0-baseline:live:prune` (dry-run) / `pnpm audit:phase0-baseline:live:prune:apply` with prune report artifacts written to `phase0_baseline_live_prune_<tag>.{md,json}` (and retained as a managed rolling family), and one-command cadence scheduling is available via `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` (non-negative integer `KEEP_PER_FAMILY` optional override).
|
Follow-up status (2026-02-27): live channel-session artifacts exist under `docs/plans/artifacts/phase0_baseline_live_2026-02-27.*` via `pnpm audit:phase0-baseline:live` (anonymized IDs), and a second gateway-origin live window (including `run.cancel` + `cancel_requested`/`cancelled`) exists under `docs/plans/artifacts/phase0_baseline_live_gateway_2026-02-27.*`. Gateway window refreshes can now run via `pnpm audit:phase0-baseline:live:gateway` (auto-selected cancel window), all live windows can be refreshed together with `pnpm audit:phase0-baseline:live:refresh` (channel + gateway + backend-scoped `pi`/`native`; scheduling example included in README), backend artifact freshness/drift checks are now available via `pnpm audit:phase0-baseline:live:drift` (or chained with `pnpm audit:phase0-baseline:live:refresh:drift`) with drift report artifacts written to `docs/plans/artifacts/phase0_baseline_live_backend_drift_<tag>.{md,json}` and optional reaction match/skip drift thresholds, cadence runs can preserve distinct timestamped comparison points via `pnpm audit:phase0-baseline:live:refresh:drift:rolling` (supports shared `TAG` override), rolling-tag retention can be managed via `pnpm audit:phase0-baseline:live:prune` (dry-run) / `pnpm audit:phase0-baseline:live:prune:apply` with prune report artifacts written to `phase0_baseline_live_prune_<tag>.{md,json}` (and retained as a managed rolling family), and one-command cadence scheduling is available via `pnpm audit:phase0-baseline:live:refresh:drift:rolling:prune` (non-negative integer `KEEP_PER_FAMILY` optional override).
|
||||||
|
|
||||||
## Subagent Model Assignment Plan
|
## Subagent Model Assignment Plan
|
||||||
|
|
||||||
|
|||||||
@@ -533,6 +533,25 @@
|
|||||||
],
|
],
|
||||||
"test_status": "pnpm test:run src/audit/phase0GatewayWindow.test.ts + pnpm typecheck passing"
|
"test_status": "pnpm test:run src/audit/phase0GatewayWindow.test.ts + pnpm typecheck passing"
|
||||||
},
|
},
|
||||||
|
"phase0-live-baseline-drift-reaction-threshold-gates": {
|
||||||
|
"status": "completed",
|
||||||
|
"date": "2026-02-27",
|
||||||
|
"updated": "2026-02-27",
|
||||||
|
"summary": "Extended backend drift gating with optional reaction decision thresholds (`maxReactionMatchRateDropPp`, `maxReactionSkipRateIncreasePp`), wired new CLI flags, and enabled conservative defaults in package drift cadence commands.",
|
||||||
|
"files_modified": [
|
||||||
|
"src/audit/phase0BaselineDrift.ts",
|
||||||
|
"src/audit/phase0BaselineDrift.test.ts",
|
||||||
|
"scripts/check-phase0-baseline-backend-drift.ts",
|
||||||
|
"package.json",
|
||||||
|
"README.md",
|
||||||
|
"docs/api/PROTOCOL.md",
|
||||||
|
"docs/architecture/AGENT_DIAGRAM.md",
|
||||||
|
"docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md",
|
||||||
|
"docs/plans/2026-02-25-phase0-instrumentation-ticket-checklist.md",
|
||||||
|
"docs/plans/state.json"
|
||||||
|
],
|
||||||
|
"test_status": "pnpm test:run src/audit/phase0BaselineDrift.test.ts + pnpm typecheck passing"
|
||||||
|
},
|
||||||
"phase0-instrumentation-ticket-checklist": {
|
"phase0-instrumentation-ticket-checklist": {
|
||||||
"status": "completed",
|
"status": "completed",
|
||||||
"date": "2026-02-25",
|
"date": "2026-02-25",
|
||||||
|
|||||||
+2
-2
@@ -27,9 +27,9 @@
|
|||||||
"audit:phase0-baseline:live:native": "node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend native --exclude-session-substring probe",
|
"audit:phase0-baseline:live:native": "node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend native --exclude-session-substring probe",
|
||||||
"audit:phase0-baseline:live:gateway": "node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source gateway --auto-gateway-cancel-window",
|
"audit:phase0-baseline:live:gateway": "node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source gateway --auto-gateway-cancel-window",
|
||||||
"audit:phase0-baseline:live:refresh": "pnpm audit:phase0-baseline:live && pnpm audit:phase0-baseline:live:gateway && pnpm audit:phase0-baseline:live:pi && pnpm audit:phase0-baseline:live:native",
|
"audit:phase0-baseline:live:refresh": "pnpm audit:phase0-baseline:live && pnpm audit:phase0-baseline:live:gateway && pnpm audit:phase0-baseline:live:pi && pnpm audit:phase0-baseline:live:native",
|
||||||
"audit:phase0-baseline:live:drift": "node --import tsx/esm scripts/check-phase0-baseline-backend-drift.ts --artifacts-dir docs/plans/artifacts --backend pi_embedded,native --max-age-hours 36 --min-candidate-sampled-events 10 --max-sampled-events-drop-pct 80 --max-run-outcomes-drop-pct 80 --max-completion-rate-drop-pp 35 --max-cancel-rate-increase-pp 25 --max-error-rate-increase-pp 25 --max-cancel-latency-p95-increase-ms 6000 --write-default-artifacts",
|
"audit:phase0-baseline:live:drift": "node --import tsx/esm scripts/check-phase0-baseline-backend-drift.ts --artifacts-dir docs/plans/artifacts --backend pi_embedded,native --max-age-hours 36 --min-candidate-sampled-events 10 --max-sampled-events-drop-pct 80 --max-run-outcomes-drop-pct 80 --max-completion-rate-drop-pp 35 --max-cancel-rate-increase-pp 25 --max-error-rate-increase-pp 25 --max-cancel-latency-p95-increase-ms 6000 --max-reaction-match-rate-drop-pp 50 --max-reaction-skip-rate-increase-pp 50 --write-default-artifacts",
|
||||||
"audit:phase0-baseline:live:refresh:drift": "pnpm audit:phase0-baseline:live:refresh && pnpm audit:phase0-baseline:live:drift",
|
"audit:phase0-baseline:live:refresh:drift": "pnpm audit:phase0-baseline:live:refresh && pnpm audit:phase0-baseline:live:drift",
|
||||||
"audit:phase0-baseline:live:refresh:drift:rolling": "TAG=${TAG:-$(date -u +%F-%H%M%S)} && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source gateway --auto-gateway-cancel-window --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend pi_embedded --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend native --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/check-phase0-baseline-backend-drift.ts --artifacts-dir docs/plans/artifacts --backend pi_embedded,native --max-age-hours 36 --min-candidate-sampled-events 10 --max-sampled-events-drop-pct 80 --max-run-outcomes-drop-pct 80 --max-completion-rate-drop-pp 35 --max-cancel-rate-increase-pp 25 --max-error-rate-increase-pp 25 --max-cancel-latency-p95-increase-ms 6000 --write-default-artifacts --tag \"$TAG\" --report-tag \"$TAG\"",
|
"audit:phase0-baseline:live:refresh:drift:rolling": "TAG=${TAG:-$(date -u +%F-%H%M%S)} && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source gateway --auto-gateway-cancel-window --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend pi_embedded --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/capture-phase0-live-baseline.ts --audit ~/.local/share/flynn/audit.log --source channel --backend native --exclude-session-substring probe --tag \"$TAG\" && node --import tsx/esm scripts/check-phase0-baseline-backend-drift.ts --artifacts-dir docs/plans/artifacts --backend pi_embedded,native --max-age-hours 36 --min-candidate-sampled-events 10 --max-sampled-events-drop-pct 80 --max-run-outcomes-drop-pct 80 --max-completion-rate-drop-pp 35 --max-cancel-rate-increase-pp 25 --max-error-rate-increase-pp 25 --max-cancel-latency-p95-increase-ms 6000 --max-reaction-match-rate-drop-pp 50 --max-reaction-skip-rate-increase-pp 50 --write-default-artifacts --tag \"$TAG\" --report-tag \"$TAG\"",
|
||||||
"audit:phase0-baseline:live:prune": "KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} && node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --write-default-artifacts",
|
"audit:phase0-baseline:live:prune": "KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} && node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --write-default-artifacts",
|
||||||
"audit:phase0-baseline:live:prune:apply": "KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} && node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --apply --write-default-artifacts",
|
"audit:phase0-baseline:live:prune:apply": "KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} && node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --apply --write-default-artifacts",
|
||||||
"audit:phase0-baseline:live:refresh:drift:rolling:prune": "TAG=${TAG:-$(date -u +%F-%H%M%S)} && TAG=\"$TAG\" pnpm audit:phase0-baseline:live:refresh:drift:rolling && KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --apply --write-default-artifacts --report-tag \"$TAG\"",
|
"audit:phase0-baseline:live:refresh:drift:rolling:prune": "TAG=${TAG:-$(date -u +%F-%H%M%S)} && TAG=\"$TAG\" pnpm audit:phase0-baseline:live:refresh:drift:rolling && KEEP_PER_FAMILY=${KEEP_PER_FAMILY:-8} node --import tsx/esm scripts/prune-phase0-baseline-artifacts.ts --artifacts-dir docs/plans/artifacts --keep-per-family \"$KEEP_PER_FAMILY\" --apply --write-default-artifacts --report-tag \"$TAG\"",
|
||||||
|
|||||||
@@ -77,6 +77,8 @@ function usage(): string {
|
|||||||
' --max-cancel-rate-increase-pp <number>',
|
' --max-cancel-rate-increase-pp <number>',
|
||||||
' --max-error-rate-increase-pp <number>',
|
' --max-error-rate-increase-pp <number>',
|
||||||
' --max-cancel-latency-p95-increase-ms <number>',
|
' --max-cancel-latency-p95-increase-ms <number>',
|
||||||
|
' --max-reaction-match-rate-drop-pp <number>',
|
||||||
|
' --max-reaction-skip-rate-increase-pp <number>',
|
||||||
].join('\n');
|
].join('\n');
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -184,6 +186,8 @@ function buildThresholds(values: Record<string, string | boolean | undefined>):
|
|||||||
maxCancelRateIncreasePp: parseOptionalNumber(values['max-cancel-rate-increase-pp'] as string | undefined, '--max-cancel-rate-increase-pp'),
|
maxCancelRateIncreasePp: parseOptionalNumber(values['max-cancel-rate-increase-pp'] as string | undefined, '--max-cancel-rate-increase-pp'),
|
||||||
maxErrorRateIncreasePp: parseOptionalNumber(values['max-error-rate-increase-pp'] as string | undefined, '--max-error-rate-increase-pp'),
|
maxErrorRateIncreasePp: parseOptionalNumber(values['max-error-rate-increase-pp'] as string | undefined, '--max-error-rate-increase-pp'),
|
||||||
maxCancelLatencyP95IncreaseMs: parseOptionalNumber(values['max-cancel-latency-p95-increase-ms'] as string | undefined, '--max-cancel-latency-p95-increase-ms'),
|
maxCancelLatencyP95IncreaseMs: parseOptionalNumber(values['max-cancel-latency-p95-increase-ms'] as string | undefined, '--max-cancel-latency-p95-increase-ms'),
|
||||||
|
maxReactionMatchRateDropPp: parseOptionalNumber(values['max-reaction-match-rate-drop-pp'] as string | undefined, '--max-reaction-match-rate-drop-pp'),
|
||||||
|
maxReactionSkipRateIncreasePp: parseOptionalNumber(values['max-reaction-skip-rate-increase-pp'] as string | undefined, '--max-reaction-skip-rate-increase-pp'),
|
||||||
};
|
};
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -345,6 +349,8 @@ async function main(): Promise<void> {
|
|||||||
'max-cancel-rate-increase-pp': { type: 'string' },
|
'max-cancel-rate-increase-pp': { type: 'string' },
|
||||||
'max-error-rate-increase-pp': { type: 'string' },
|
'max-error-rate-increase-pp': { type: 'string' },
|
||||||
'max-cancel-latency-p95-increase-ms': { type: 'string' },
|
'max-cancel-latency-p95-increase-ms': { type: 'string' },
|
||||||
|
'max-reaction-match-rate-drop-pp': { type: 'string' },
|
||||||
|
'max-reaction-skip-rate-increase-pp': { type: 'string' },
|
||||||
format: { type: 'string' },
|
format: { type: 'string' },
|
||||||
out: { type: 'string' },
|
out: { type: 'string' },
|
||||||
help: { type: 'boolean', short: 'h' },
|
help: { type: 'boolean', short: 'h' },
|
||||||
|
|||||||
@@ -381,6 +381,12 @@ describe('phase0BaselineDrift', () => {
|
|||||||
expect(() => evaluatePhase0BaselineDriftGate(comparison, {
|
expect(() => evaluatePhase0BaselineDriftGate(comparison, {
|
||||||
minCandidateSampledEvents: -5,
|
minCandidateSampledEvents: -5,
|
||||||
})).toThrow('minCandidateSampledEvents');
|
})).toThrow('minCandidateSampledEvents');
|
||||||
|
expect(() => evaluatePhase0BaselineDriftGate(comparison, {
|
||||||
|
maxReactionMatchRateDropPp: -1,
|
||||||
|
})).toThrow('maxReactionMatchRateDropPp');
|
||||||
|
expect(() => evaluatePhase0BaselineDriftGate(comparison, {
|
||||||
|
maxReactionSkipRateIncreasePp: -1,
|
||||||
|
})).toThrow('maxReactionSkipRateIncreasePp');
|
||||||
});
|
});
|
||||||
|
|
||||||
it('rejects non-integer sampled-event minimum thresholds', () => {
|
it('rejects non-integer sampled-event minimum thresholds', () => {
|
||||||
@@ -427,4 +433,95 @@ describe('phase0BaselineDrift', () => {
|
|||||||
minBaselineSampledEvents: 8.2,
|
minBaselineSampledEvents: 8.2,
|
||||||
})).toThrow('minBaselineSampledEvents');
|
})).toThrow('minBaselineSampledEvents');
|
||||||
});
|
});
|
||||||
|
|
||||||
|
it('evaluates optional reaction rate drift thresholds', () => {
|
||||||
|
const comparison = comparePhase0BaselineDrift(
|
||||||
|
{
|
||||||
|
sampled_event_count: 50,
|
||||||
|
summary: {
|
||||||
|
event_counts: {
|
||||||
|
run_state: 0,
|
||||||
|
run_cancel: 0,
|
||||||
|
reaction_match: 0,
|
||||||
|
reaction_skip: 0,
|
||||||
|
},
|
||||||
|
run_outcomes: {
|
||||||
|
overall: {
|
||||||
|
total_outcomes: 20,
|
||||||
|
complete: 18,
|
||||||
|
cancelled: 1,
|
||||||
|
error: 1,
|
||||||
|
cancel_requested: 0,
|
||||||
|
start: 20,
|
||||||
|
completion_rate_pct: 90,
|
||||||
|
cancel_rate_pct: 5,
|
||||||
|
error_rate_pct: 5,
|
||||||
|
},
|
||||||
|
by_channel: [],
|
||||||
|
by_session: [],
|
||||||
|
},
|
||||||
|
cancel_latency_ms: null,
|
||||||
|
reactions: {
|
||||||
|
matched: 4,
|
||||||
|
skipped: 6,
|
||||||
|
total: 10,
|
||||||
|
match_rate_pct: 40,
|
||||||
|
skip_rate_pct: 60,
|
||||||
|
skip_reasons: [],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
{
|
||||||
|
sampled_event_count: 50,
|
||||||
|
summary: {
|
||||||
|
event_counts: {
|
||||||
|
run_state: 0,
|
||||||
|
run_cancel: 0,
|
||||||
|
reaction_match: 0,
|
||||||
|
reaction_skip: 0,
|
||||||
|
},
|
||||||
|
run_outcomes: {
|
||||||
|
overall: {
|
||||||
|
total_outcomes: 20,
|
||||||
|
complete: 18,
|
||||||
|
cancelled: 1,
|
||||||
|
error: 1,
|
||||||
|
cancel_requested: 0,
|
||||||
|
start: 20,
|
||||||
|
completion_rate_pct: 90,
|
||||||
|
cancel_rate_pct: 5,
|
||||||
|
error_rate_pct: 5,
|
||||||
|
},
|
||||||
|
by_channel: [],
|
||||||
|
by_session: [],
|
||||||
|
},
|
||||||
|
cancel_latency_ms: null,
|
||||||
|
reactions: {
|
||||||
|
matched: 7,
|
||||||
|
skipped: 3,
|
||||||
|
total: 10,
|
||||||
|
match_rate_pct: 70,
|
||||||
|
skip_rate_pct: 30,
|
||||||
|
skip_reasons: [],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
},
|
||||||
|
);
|
||||||
|
|
||||||
|
const pass = evaluatePhase0BaselineDriftGate(comparison, {
|
||||||
|
maxReactionMatchRateDropPp: 35,
|
||||||
|
maxReactionSkipRateIncreasePp: 35,
|
||||||
|
});
|
||||||
|
expect(pass.pass).toBe(true);
|
||||||
|
|
||||||
|
const fail = evaluatePhase0BaselineDriftGate(comparison, {
|
||||||
|
maxReactionMatchRateDropPp: 20,
|
||||||
|
maxReactionSkipRateIncreasePp: 20,
|
||||||
|
});
|
||||||
|
expect(fail.pass).toBe(false);
|
||||||
|
expect(fail.criteria.filter((row) => !row.pass).map((row) => row.criterion)).toEqual([
|
||||||
|
'reaction_match_rate_drop_pp',
|
||||||
|
'reaction_skip_rate_increase_pp',
|
||||||
|
]);
|
||||||
|
});
|
||||||
});
|
});
|
||||||
|
|||||||
@@ -46,6 +46,8 @@ export interface Phase0BaselineDriftGateThresholds {
|
|||||||
maxCancelRateIncreasePp?: number;
|
maxCancelRateIncreasePp?: number;
|
||||||
maxErrorRateIncreasePp?: number;
|
maxErrorRateIncreasePp?: number;
|
||||||
maxCancelLatencyP95IncreaseMs?: number;
|
maxCancelLatencyP95IncreaseMs?: number;
|
||||||
|
maxReactionMatchRateDropPp?: number;
|
||||||
|
maxReactionSkipRateIncreasePp?: number;
|
||||||
}
|
}
|
||||||
|
|
||||||
export interface Phase0BaselineDriftGateCriterion {
|
export interface Phase0BaselineDriftGateCriterion {
|
||||||
@@ -332,6 +334,48 @@ export function evaluatePhase0BaselineDriftGate(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
const maxReactionMatchRateDropPp = readThreshold(thresholds.maxReactionMatchRateDropPp, 'maxReactionMatchRateDropPp');
|
||||||
|
if (typeof maxReactionMatchRateDropPp === 'number') {
|
||||||
|
const delta = comparison.deltas.reaction_match_rate_pp;
|
||||||
|
if (delta === null) {
|
||||||
|
criteria.push({
|
||||||
|
criterion: 'reaction_match_rate_drop_pp',
|
||||||
|
pass: !requireBaselineHistory,
|
||||||
|
actual: 'n/a',
|
||||||
|
threshold: `<= ${maxReactionMatchRateDropPp}`,
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
const drop = Math.max(0, -delta);
|
||||||
|
criteria.push({
|
||||||
|
criterion: 'reaction_match_rate_drop_pp',
|
||||||
|
pass: drop <= maxReactionMatchRateDropPp,
|
||||||
|
actual: `${Math.round(drop * 100) / 100}`,
|
||||||
|
threshold: `<= ${maxReactionMatchRateDropPp}`,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const maxReactionSkipRateIncreasePp = readThreshold(thresholds.maxReactionSkipRateIncreasePp, 'maxReactionSkipRateIncreasePp');
|
||||||
|
if (typeof maxReactionSkipRateIncreasePp === 'number') {
|
||||||
|
const delta = comparison.deltas.reaction_skip_rate_pp;
|
||||||
|
if (delta === null) {
|
||||||
|
criteria.push({
|
||||||
|
criterion: 'reaction_skip_rate_increase_pp',
|
||||||
|
pass: !requireBaselineHistory,
|
||||||
|
actual: 'n/a',
|
||||||
|
threshold: `<= ${maxReactionSkipRateIncreasePp}`,
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
const increase = Math.max(0, delta);
|
||||||
|
criteria.push({
|
||||||
|
criterion: 'reaction_skip_rate_increase_pp',
|
||||||
|
pass: increase <= maxReactionSkipRateIncreasePp,
|
||||||
|
actual: `${Math.round(increase * 100) / 100}`,
|
||||||
|
threshold: `<= ${maxReactionSkipRateIncreasePp}`,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
return {
|
return {
|
||||||
pass: criteria.every((row) => row.pass),
|
pass: criteria.every((row) => row.pass),
|
||||||
criteria,
|
criteria,
|
||||||
|
|||||||
Reference in New Issue
Block a user