docs(eval): record pi canary window A results and hold decision

This commit is contained in:
William Valentin
2026-02-23 22:29:08 -08:00
parent 4f88e047fd
commit 9156adb2a8
4 changed files with 200 additions and 18 deletions
+19 -15
View File
@@ -65,17 +65,19 @@ pnpm audit:backend-canary \
### Window A
- Dates: _TBD_
- Route volume: _TBD_
- Summary artifact: _TBD_
- Dates: February 24, 2026 (05:29:49Z to 06:26:20Z)
- Route volume: 10 total routes (`pi_embedded`: 8, `native`: 2)
- Summary artifacts:
- `docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.md`
- `docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.json`
| Check | Result | Notes |
| --- | --- | --- |
| Completion rate delta | _TBD_ | |
| P50 latency delta | _TBD_ | |
| P95 latency delta | _TBD_ | |
| Fallback rate | _TBD_ | |
| Guardrail escapes | _TBD_ | |
| Completion rate delta | 0.00pp (pass) | target 100.00% vs baseline 100.00% |
| P50 latency delta | +259ms (fail) | gate <= +250ms |
| P95 latency delta | +5695ms (fail) | gate <= +700ms |
| Fallback rate | 25.00% (fail) | 2 fallbacks / 8 attempts; gate <= 5.00% |
| Guardrail escapes | none observed (provisional pass) | no `forced_native_guard` events in this window |
### Window B
@@ -97,16 +99,18 @@ Track all tool-adjacent/risky prompts that were force-routed to native (`no_tool
| Class | Observed behavior | Action |
| --- | --- | --- |
| Tool-adjacent prompts | _TBD_ | |
| Capability-query prompts | _TBD_ | |
| Attachments-present turns | _TBD_ | |
| Tool-adjacent prompts | Not observed in Window A (`forced_native_guard` count 0). | Collect dedicated tool-adjacent prompts in Window B to validate `no_tools_mode` behavior. |
| Capability-query prompts | Not observed in Window A (`guard_reason=capability_query` count 0). | Add explicit capability-query probes in Window B. |
| Attachments-present turns | Not observed in Window A (`guard_reason=attachments_present` count 0). | Add attachment turns in Window B. |
## Decision Record
- Decision date: _TBD_
- Decision: _expand | hold | rollback_
- Rationale: _TBD_
- Next cohort/config delta: _TBD_
- Decision date: February 24, 2026
- Decision: `hold` (no cohort expansion yet)
- Rationale: Window A fails 3/4 numeric gates (p50 delta, p95 delta, fallback rate) with only 10 total routed turns, including two concrete fallback failure modes:
- module session factory mismatch
- no assistant text returned from Pi runtime
- Next cohort/config delta: none until Window B confirms gate pass and fallback causes are remediated.
## Diagram/Protocol Impact Review