docs(eval): close pi canary phase with rollback decision and probe evidence
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Pi Embedded Canary Evaluation (Phase 2)
|
||||
|
||||
Status: in progress
|
||||
Status: completed
|
||||
Owner: Flynn maintainers
|
||||
Started: 2026-02-24
|
||||
|
||||
@@ -93,6 +93,26 @@ pnpm audit:backend-canary \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
Generate a local router guard-probe log (synthetic inbound turns through `createMessageRouter`):
|
||||
|
||||
```bash
|
||||
pnpm audit:backend-canary:probes
|
||||
```
|
||||
|
||||
Then evaluate guard coverage against the probe log:
|
||||
|
||||
```bash
|
||||
pnpm audit:backend-canary \
|
||||
--audit docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl \
|
||||
--backend pi_embedded \
|
||||
--baseline native \
|
||||
--session telegram:8367012007 \
|
||||
--gate-min-guard-pi-no-tools-count 1 \
|
||||
--gate-min-guard-capability-query-count 1 \
|
||||
--gate-min-guard-attachments-present-count 1 \
|
||||
--format markdown
|
||||
```
|
||||
|
||||
## Evaluation Log
|
||||
|
||||
### Window A
|
||||
@@ -154,26 +174,41 @@ pnpm audit:backend-canary \
|
||||
| P95 latency delta | +5695ms (fail) | gate <= +700ms |
|
||||
| Fallback rate | 25.00% (fail) | 2 fallbacks / 8 attempts; gate <= 5.00% |
|
||||
|
||||
### Window D (Guard Coverage Controlled Probes)
|
||||
|
||||
- Dates: February 24, 2026 (local synthetic probe run)
|
||||
- Route volume: 4 total routes (`pi_embedded`: 1, `native`: 3)
|
||||
- Summary artifacts:
|
||||
- `docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl`
|
||||
- `docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.md`
|
||||
- `docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.json`
|
||||
|
||||
| Check | Result | Notes |
|
||||
| --- | --- | --- |
|
||||
| Minimum `pi_no_tools_mode` guard hits | 1 (pass) | gate >= 1 |
|
||||
| Minimum `capability_query` guard hits | 1 (pass) | gate >= 1 |
|
||||
| Minimum `attachments_present` guard hits | 1 (pass) | gate >= 1 |
|
||||
|
||||
## Tool Compatibility Findings
|
||||
|
||||
Track all tool-adjacent/risky prompts that were force-routed to native (`no_tools_mode`) and any misses.
|
||||
|
||||
| Class | Observed behavior | Action |
|
||||
| --- | --- | --- |
|
||||
| Tool-adjacent prompts | Not observed in Window A/B (`forced_native_guard` count 0). | Run controlled probe turns that should trigger `pi_no_tools_mode`. |
|
||||
| Capability-query prompts | Not observed in Window A/B (`guard_reason=capability_query` count 0). | Run explicit capability-query probe prompts and confirm forced-native routing. |
|
||||
| Attachments-present turns | Not observed in Window A/B (`guard_reason=attachments_present` count 0). | Run attachment-bearing probe turns and confirm forced-native routing. |
|
||||
| Tool-adjacent prompts | Verified in controlled probes: `pi_no_tools_mode` guard hit = 1. | Keep this in regression probes for future backend iterations. |
|
||||
| Capability-query prompts | Verified in controlled probes: `capability_query` guard hit = 1. | Keep this in regression probes for future backend iterations. |
|
||||
| Attachments-present turns | Verified in controlled probes: `attachments_present` guard hit = 1. | Keep this in regression probes for future backend iterations. |
|
||||
|
||||
## Decision Record
|
||||
|
||||
- Decision date: February 24, 2026
|
||||
- Decision: `hold` (no cohort expansion yet)
|
||||
- Rationale: Window A fails 3/4 numeric gates (p50 delta, p95 delta, fallback rate) with only 10 total routed turns, including two concrete fallback failure modes:
|
||||
and Window C pre-probe baseline confirms missing guard-coverage evidence (`pi_no_tools_mode`, `capability_query`, `attachments_present` all at 0).
|
||||
- Decision: `rollback` (end canary routing for now)
|
||||
- Rationale: Window A fails core numeric gates (p50 delta, p95 delta, fallback rate) with two concrete fallback failure modes:
|
||||
- `pi_module_interface`
|
||||
- `empty_assistant_text`
|
||||
Window B shows fallback recovery (0%) in a post-fallback slice but fails minimum sample thresholds and has no native baseline routes for delta-gate evaluation.
|
||||
- Next cohort/config delta: none until a baseline-balanced window meets minimum sample thresholds and guardrail coverage probes are completed.
|
||||
Window B shows fallback recovery (0%) but fails minimum sample/baseline gates, so it does not overturn Window A.
|
||||
Window D confirms guardrail compatibility behavior in controlled probes, but guard compatibility alone is insufficient to justify expansion.
|
||||
- Next cohort/config delta: route canary users back to native path (remove/disable `pi_embedded` canary routing) until latency/fallback defects are remediated and a fresh canary is re-approved.
|
||||
|
||||
## Diagram/Protocol Impact Review
|
||||
|
||||
|
||||
Reference in New Issue
Block a user