docs(eval): close pi canary phase with rollback decision and probe evidence

This commit is contained in:
William Valentin
2026-02-23 22:44:30 -08:00
parent 959216ac5c
commit 8b1f2d9a67
6 changed files with 231 additions and 14 deletions
+44 -9
View File
@@ -1,6 +1,6 @@
# Pi Embedded Canary Evaluation (Phase 2)
Status: in progress
Status: completed
Owner: Flynn maintainers
Started: 2026-02-24
@@ -93,6 +93,26 @@ pnpm audit:backend-canary \
--format markdown
```
Generate a local router guard-probe log (synthetic inbound turns through `createMessageRouter`):
```bash
pnpm audit:backend-canary:probes
```
Then evaluate guard coverage against the probe log:
```bash
pnpm audit:backend-canary \
--audit docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl \
--backend pi_embedded \
--baseline native \
--session telegram:8367012007 \
--gate-min-guard-pi-no-tools-count 1 \
--gate-min-guard-capability-query-count 1 \
--gate-min-guard-attachments-present-count 1 \
--format markdown
```
## Evaluation Log
### Window A
@@ -154,26 +174,41 @@ pnpm audit:backend-canary \
| P95 latency delta | +5695ms (fail) | gate <= +700ms |
| Fallback rate | 25.00% (fail) | 2 fallbacks / 8 attempts; gate <= 5.00% |
### Window D (Guard Coverage Controlled Probes)
- Dates: February 24, 2026 (local synthetic probe run)
- Route volume: 4 total routes (`pi_embedded`: 1, `native`: 3)
- Summary artifacts:
- `docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl`
- `docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.md`
- `docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.json`
| Check | Result | Notes |
| --- | --- | --- |
| Minimum `pi_no_tools_mode` guard hits | 1 (pass) | gate >= 1 |
| Minimum `capability_query` guard hits | 1 (pass) | gate >= 1 |
| Minimum `attachments_present` guard hits | 1 (pass) | gate >= 1 |
## Tool Compatibility Findings
Track all tool-adjacent/risky prompts that were force-routed to native (`no_tools_mode`) and any misses.
| Class | Observed behavior | Action |
| --- | --- | --- |
| Tool-adjacent prompts | Not observed in Window A/B (`forced_native_guard` count 0). | Run controlled probe turns that should trigger `pi_no_tools_mode`. |
| Capability-query prompts | Not observed in Window A/B (`guard_reason=capability_query` count 0). | Run explicit capability-query probe prompts and confirm forced-native routing. |
| Attachments-present turns | Not observed in Window A/B (`guard_reason=attachments_present` count 0). | Run attachment-bearing probe turns and confirm forced-native routing. |
| Tool-adjacent prompts | Verified in controlled probes: `pi_no_tools_mode` guard hit = 1. | Keep this in regression probes for future backend iterations. |
| Capability-query prompts | Verified in controlled probes: `capability_query` guard hit = 1. | Keep this in regression probes for future backend iterations. |
| Attachments-present turns | Verified in controlled probes: `attachments_present` guard hit = 1. | Keep this in regression probes for future backend iterations. |
## Decision Record
- Decision date: February 24, 2026
- Decision: `hold` (no cohort expansion yet)
- Rationale: Window A fails 3/4 numeric gates (p50 delta, p95 delta, fallback rate) with only 10 total routed turns, including two concrete fallback failure modes:
and Window C pre-probe baseline confirms missing guard-coverage evidence (`pi_no_tools_mode`, `capability_query`, `attachments_present` all at 0).
- Decision: `rollback` (end canary routing for now)
- Rationale: Window A fails core numeric gates (p50 delta, p95 delta, fallback rate) with two concrete fallback failure modes:
- `pi_module_interface`
- `empty_assistant_text`
Window B shows fallback recovery (0%) in a post-fallback slice but fails minimum sample thresholds and has no native baseline routes for delta-gate evaluation.
- Next cohort/config delta: none until a baseline-balanced window meets minimum sample thresholds and guardrail coverage probes are completed.
Window B shows fallback recovery (0%) but fails minimum sample/baseline gates, so it does not overturn Window A.
Window D confirms guardrail compatibility behavior in controlled probes, but guard compatibility alone is insufficient to justify expansion.
- Next cohort/config delta: route canary users back to native path (remove/disable `pi_embedded` canary routing) until latency/fallback defects are remediated and a fresh canary is re-approved.
## Diagram/Protocol Impact Review