docs(eval): add window B telemetry slice and maintain hold decision
This commit is contained in:
@@ -81,17 +81,19 @@ pnpm audit:backend-canary \
|
||||
|
||||
### Window B
|
||||
|
||||
- Dates: _TBD_
|
||||
- Route volume: _TBD_
|
||||
- Summary artifact: _TBD_
|
||||
- Dates: February 24, 2026 (since 06:14:00Z; post-initial-fallback slice)
|
||||
- Route volume: 6 total routes (`pi_embedded`: 6, `native`: 0)
|
||||
- Summary artifacts:
|
||||
- `docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.md`
|
||||
- `docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.json`
|
||||
|
||||
| Check | Result | Notes |
|
||||
| --- | --- | --- |
|
||||
| Completion rate delta | _TBD_ | |
|
||||
| P50 latency delta | _TBD_ | |
|
||||
| P95 latency delta | _TBD_ | |
|
||||
| Fallback rate | _TBD_ | |
|
||||
| Guardrail escapes | _TBD_ | |
|
||||
| Completion rate delta | n/a (insufficient baseline) | no native-routed turns in this slice |
|
||||
| P50 latency delta | n/a (insufficient baseline) | no native-routed turns in this slice |
|
||||
| P95 latency delta | n/a (insufficient baseline) | no native-routed turns in this slice |
|
||||
| Fallback rate | 0.00% (pass) | 0 fallbacks / 6 attempts |
|
||||
| Guardrail escapes | none observed (provisional pass) | no `forced_native_guard` events in this window |
|
||||
|
||||
## Tool Compatibility Findings
|
||||
|
||||
@@ -110,7 +112,8 @@ Track all tool-adjacent/risky prompts that were force-routed to native (`no_tool
|
||||
- Rationale: Window A fails 3/4 numeric gates (p50 delta, p95 delta, fallback rate) with only 10 total routed turns, including two concrete fallback failure modes:
|
||||
- module session factory mismatch
|
||||
- no assistant text returned from Pi runtime
|
||||
- Next cohort/config delta: none until Window B confirms gate pass and fallback causes are remediated.
|
||||
Window B shows fallback recovery (0%) in a post-fallback slice but cannot evaluate delta gates because it contains no baseline native routes.
|
||||
- Next cohort/config delta: none until an additional baseline-balanced window confirms delta gates and guardrail coverage probes are completed.
|
||||
|
||||
## Diagram/Protocol Impact Review
|
||||
|
||||
|
||||
Reference in New Issue
Block a user