Pi Embedded Canary Evaluation (Phase 2)

Status: in progress
Owner: Flynn maintainers
Started: 2026-02-24

Goal

Close the canary spike with a formal, repeatable evaluation and an explicit rollout decision (expand, hold, or rollback).

Scope

Target backend: pi_embedded
Target cohort (current): telegram:8367012007
Baseline backend: native
Data source: audit events (backend.route, backend.success, backend.fallback, session.message)

Pass/Fail Gate

Use the same thresholds for every evaluation window.

Metric	Gate
Minimum target routes	>= 8
Minimum baseline routes	>= 2
Minimum target external attempts	>= 8
Minimum guard coverage (probe window)	`pi_no_tools_mode >= 1`, `capability_query >= 1`, `attachments_present >= 1`
Completion rate delta (target - baseline)	>= -2.00pp
P50 latency delta (target - baseline)	<= +250ms
P95 latency delta (target - baseline)	<= +700ms
External fallback rate (`pi_embedded`)	<= 5.00%
Guardrail escapes	0 unresolved

Notes:

Completion rate and latency are computed from route-to-assistant turn timings.
Fallback rate is computed from backend.success + backend.fallback attempt outcomes.
Guardrail escapes are reviewed from backend.route.source == forced_native_guard + operator incident review.
Guard-coverage minimums are enforced for controlled probe windows, not passive traffic slices.

How To Run

Run a canary summary for the current cohort:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --format markdown

Run with gate evaluation and emit JSON artifact:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --format json \
  --out docs/plans/artifacts/pi_embedded_eval_latest.json \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --gate-max-completion-drop-pp 2 \
  --gate-max-p50-latency-increase-ms 250 \
  --gate-max-p95-latency-increase-ms 700 \
  --gate-max-fallback-rate-pct 5

Run controlled probe-window evaluation (guard coverage required):

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --gate-min-guard-pi-no-tools-count 1 \
  --gate-min-guard-capability-query-count 1 \
  --gate-min-guard-attachments-present-count 1 \
  --gate-max-completion-drop-pp 2 \
  --gate-max-p50-latency-increase-ms 250 \
  --gate-max-p95-latency-increase-ms 700 \
  --gate-max-fallback-rate-pct 5 \
  --format markdown

Evaluation Log

Window A

Dates: February 24, 2026 (05:29:49Z to 06:26:20Z)
Route volume: 10 total routes (pi_embedded: 8, native: 2)
Summary artifacts:
- docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.md
- docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.json

Check	Result	Notes
Minimum target routes	8 (pass)	gate >= 8
Minimum baseline routes	2 (pass)	gate >= 2
Minimum target external attempts	8 (pass)	gate >= 8
Completion rate delta	0.00pp (pass)	target 100.00% vs baseline 100.00%
P50 latency delta	+259ms (fail)	gate <= +250ms
P95 latency delta	+5695ms (fail)	gate <= +700ms
Fallback rate	25.00% (fail)	2 fallbacks / 8 attempts; gate <= 5.00%
Guardrail escapes	none observed (provisional pass)	no `forced_native_guard` events in this window

Window B

Dates: February 24, 2026 (since 06:14:00Z; post-initial-fallback slice)
Route volume: 6 total routes (pi_embedded: 6, native: 0)
Summary artifacts:
- docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.md
- docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.json

Check	Result	Notes
Minimum target routes	6 (fail)	gate >= 8
Minimum baseline routes	0 (fail)	gate >= 2
Minimum target external attempts	6 (fail)	gate >= 8
Completion rate delta	n/a (insufficient baseline)	no native-routed turns in this slice
P50 latency delta	n/a (insufficient baseline)	no native-routed turns in this slice
P95 latency delta	n/a (insufficient baseline)	no native-routed turns in this slice
Fallback rate	0.00% (pass)	0 fallbacks / 6 attempts
Guardrail escapes	none observed (provisional pass)	no `forced_native_guard` events in this window

Window C (Guard Coverage Pre-Probe Baseline)

Dates: February 24, 2026 (same full Window A slice; guard-coverage gates enabled)
Route volume: 10 total routes (pi_embedded: 8, native: 2)
Summary artifacts:
- docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_preprobe.md
- docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_preprobe.json

Check	Result	Notes
Minimum target routes	8 (pass)	gate >= 8
Minimum baseline routes	2 (pass)	gate >= 2
Minimum target external attempts	8 (pass)	gate >= 8
Minimum `pi_no_tools_mode` guard hits	0 (fail)	gate >= 1
Minimum `capability_query` guard hits	0 (fail)	gate >= 1
Minimum `attachments_present` guard hits	0 (fail)	gate >= 1
Completion rate delta	0.00pp (pass)	target 100.00% vs baseline 100.00%
P50 latency delta	+259ms (fail)	gate <= +250ms
P95 latency delta	+5695ms (fail)	gate <= +700ms
Fallback rate	25.00% (fail)	2 fallbacks / 8 attempts; gate <= 5.00%

Tool Compatibility Findings

Track all tool-adjacent/risky prompts that were force-routed to native (no_tools_mode) and any misses.

Class	Observed behavior	Action
Tool-adjacent prompts	Not observed in Window A/B (`forced_native_guard` count 0).	Run controlled probe turns that should trigger `pi_no_tools_mode`.
Capability-query prompts	Not observed in Window A/B (`guard_reason=capability_query` count 0).	Run explicit capability-query probe prompts and confirm forced-native routing.
Attachments-present turns	Not observed in Window A/B (`guard_reason=attachments_present` count 0).	Run attachment-bearing probe turns and confirm forced-native routing.

Decision Record

Decision date: February 24, 2026
Decision: hold (no cohort expansion yet)
Rationale: Window A fails 3/4 numeric gates (p50 delta, p95 delta, fallback rate) with only 10 total routed turns, including two concrete fallback failure modes: and Window C pre-probe baseline confirms missing guard-coverage evidence (pi_no_tools_mode, capability_query, attachments_present all at 0).
- pi_module_interface
- empty_assistant_text Window B shows fallback recovery (0%) in a post-fallback slice but fails minimum sample thresholds and has no native baseline routes for delta-gate evaluation.
Next cohort/config delta: none until a baseline-balanced window meets minimum sample thresholds and guardrail coverage probes are completed.

Diagram/Protocol Impact Review

Reviewed: docs/architecture/AGENT_DIAGRAM.md, docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md, docs/api/PROTOCOL.md
Result: no runtime message-flow or protocol-shape changes; no Mermaid topology update required for this evaluation-tooling phase.

7.5 KiB Raw Blame History