docs(eval): record pi canary window A results and hold decision
This commit is contained in:
@@ -0,0 +1,50 @@
|
||||
# Pi Embedded Canary Summary
|
||||
|
||||
- Target backend: `pi_embedded`
|
||||
- Baseline backend: `native`
|
||||
- Routes analyzed: 10
|
||||
|
||||
## Route Distribution
|
||||
|
||||
| Backend | Routes |
|
||||
| --- | ---: |
|
||||
| pi_embedded | 8 |
|
||||
| native | 2 |
|
||||
|
||||
## Reliability
|
||||
|
||||
| Metric | Target | Baseline | Delta |
|
||||
| --- | ---: | ---: | ---: |
|
||||
| Turn completion rate | 100.00% | 100.00% | 0.00pp |
|
||||
| External success rate | 75.00% | n/a | n/a |
|
||||
| External attempts | 8 | n/a | n/a |
|
||||
| External fallbacks | 2 | n/a | n/a |
|
||||
|
||||
## Latency
|
||||
|
||||
- Target end-to-end: count=8, avg=4615ms, p50=3240ms, p95=8776ms, min=1859ms, max=9381ms
|
||||
- Baseline end-to-end: count=2, avg=2981ms, p50=2981ms, p95=3081ms, min=2870ms, max=3092ms
|
||||
- P50 delta (target - baseline): 259ms
|
||||
- P95 delta (target - baseline): 5695ms
|
||||
- Target external attempt: count=8, avg=3961ms, p50=2636ms, p95=8766ms, min=135ms, max=9371ms
|
||||
|
||||
## Fallback Taxonomy
|
||||
|
||||
| Category | Count | Percent |
|
||||
| --- | ---: | ---: |
|
||||
| loaded pi module does not expose a supported session factory (expected one of: c | 1 | 50.00% |
|
||||
| pi agent runtime produced no assistant text | 1 | 50.00% |
|
||||
|
||||
## Top Fallback Reasons
|
||||
|
||||
- Loaded Pi module does not expose a supported session factory (expected one of: createAgentSession, createSession, createPiSession, createAge (1)
|
||||
- Pi Agent runtime produced no assistant text (1)
|
||||
|
||||
## Gate Evaluation
|
||||
|
||||
- Gate result: HOLD
|
||||
- [x] Completion rate delta (target - baseline): actual=0.00pp, threshold=>= -2.00pp
|
||||
- [ ] P50 latency delta (target - baseline): actual=259ms, threshold=<= 250ms
|
||||
- [ ] P95 latency delta (target - baseline): actual=5695ms, threshold=<= 700ms
|
||||
- [ ] Fallback rate (target external attempts): actual=25.00%, threshold=<= 5.00%
|
||||
|
||||
Reference in New Issue
Block a user