Files
flynn/docs/plans/pi_embedded_evaluation.md
2026-02-23 22:48:32 -08:00

8.8 KiB

Pi Embedded Canary Evaluation (Phase 2)

Status: completed
Owner: Flynn maintainers
Started: 2026-02-24

Goal

Close the canary spike with a formal, repeatable evaluation and an explicit rollout decision (expand, hold, or rollback).

Scope

  • Target backend: pi_embedded
  • Target cohort (current): telegram:8367012007
  • Baseline backend: native
  • Data source: audit events (backend.route, backend.success, backend.fallback, session.message)

Pass/Fail Gate

Use the same thresholds for every evaluation window.

Metric Gate
Minimum target routes >= 8
Minimum baseline routes >= 2
Minimum target external attempts >= 8
Minimum guard coverage (probe window) pi_no_tools_mode >= 1, capability_query >= 1, attachments_present >= 1
Completion rate delta (target - baseline) >= -2.00pp
P50 latency delta (target - baseline) <= +250ms
P95 latency delta (target - baseline) <= +700ms
External fallback rate (pi_embedded) <= 5.00%
Guardrail escapes 0 unresolved

Notes:

  • Completion rate and latency are computed from route-to-assistant turn timings.
  • Fallback rate is computed from backend.success + backend.fallback attempt outcomes.
  • Guardrail escapes are reviewed from backend.route.source == forced_native_guard + operator incident review.
  • Guard-coverage minimums are enforced for controlled probe windows, not passive traffic slices.

How To Run

Run a canary summary for the current cohort:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --format markdown

Run with gate evaluation and emit JSON artifact:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --format json \
  --out docs/plans/artifacts/pi_embedded_eval_latest.json \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --gate-max-completion-drop-pp 2 \
  --gate-max-p50-latency-increase-ms 250 \
  --gate-max-p95-latency-increase-ms 700 \
  --gate-max-fallback-rate-pct 5

Run controlled probe-window evaluation (guard coverage required):

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --gate-min-target-routes 8 \
  --gate-min-baseline-routes 2 \
  --gate-min-target-attempts 8 \
  --gate-min-guard-pi-no-tools-count 1 \
  --gate-min-guard-capability-query-count 1 \
  --gate-min-guard-attachments-present-count 1 \
  --gate-max-completion-drop-pp 2 \
  --gate-max-p50-latency-increase-ms 250 \
  --gate-max-p95-latency-increase-ms 700 \
  --gate-max-fallback-rate-pct 5 \
  --format markdown

Generate a local router guard-probe log (synthetic inbound turns through createMessageRouter):

pnpm audit:backend-canary:probes

Then evaluate guard coverage against the probe log:

pnpm audit:backend-canary \
  --audit docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --gate-min-guard-pi-no-tools-count 1 \
  --gate-min-guard-capability-query-count 1 \
  --gate-min-guard-attachments-present-count 1 \
  --format markdown

Evaluation Log

Window A

  • Dates: February 24, 2026 (05:29:49Z to 06:26:20Z)
  • Route volume: 10 total routes (pi_embedded: 8, native: 2)
  • Summary artifacts:
    • docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.md
    • docs/plans/artifacts/pi_embedded_eval_window_a_2026-02-24.json
Check Result Notes
Minimum target routes 8 (pass) gate >= 8
Minimum baseline routes 2 (pass) gate >= 2
Minimum target external attempts 8 (pass) gate >= 8
Completion rate delta 0.00pp (pass) target 100.00% vs baseline 100.00%
P50 latency delta +259ms (fail) gate <= +250ms
P95 latency delta +5695ms (fail) gate <= +700ms
Fallback rate 25.00% (fail) 2 fallbacks / 8 attempts; gate <= 5.00%
Guardrail escapes none observed (provisional pass) no forced_native_guard events in this window

Window B

  • Dates: February 24, 2026 (since 06:14:00Z; post-initial-fallback slice)
  • Route volume: 6 total routes (pi_embedded: 6, native: 0)
  • Summary artifacts:
    • docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.md
    • docs/plans/artifacts/pi_embedded_eval_window_b_2026-02-24_post_fallbacks.json
Check Result Notes
Minimum target routes 6 (fail) gate >= 8
Minimum baseline routes 0 (fail) gate >= 2
Minimum target external attempts 6 (fail) gate >= 8
Completion rate delta n/a (insufficient baseline) no native-routed turns in this slice
P50 latency delta n/a (insufficient baseline) no native-routed turns in this slice
P95 latency delta n/a (insufficient baseline) no native-routed turns in this slice
Fallback rate 0.00% (pass) 0 fallbacks / 6 attempts
Guardrail escapes none observed (provisional pass) no forced_native_guard events in this window

Window C (Guard Coverage Pre-Probe Baseline)

  • Dates: February 24, 2026 (same full Window A slice; guard-coverage gates enabled)
  • Route volume: 10 total routes (pi_embedded: 8, native: 2)
  • Summary artifacts:
    • docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_preprobe.md
    • docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_preprobe.json
Check Result Notes
Minimum target routes 8 (pass) gate >= 8
Minimum baseline routes 2 (pass) gate >= 2
Minimum target external attempts 8 (pass) gate >= 8
Minimum pi_no_tools_mode guard hits 0 (fail) gate >= 1
Minimum capability_query guard hits 0 (fail) gate >= 1
Minimum attachments_present guard hits 0 (fail) gate >= 1
Completion rate delta 0.00pp (pass) target 100.00% vs baseline 100.00%
P50 latency delta +259ms (fail) gate <= +250ms
P95 latency delta +5695ms (fail) gate <= +700ms
Fallback rate 25.00% (fail) 2 fallbacks / 8 attempts; gate <= 5.00%

Window D (Guard Coverage Controlled Probes)

  • Dates: February 24, 2026 (local synthetic probe run)
  • Route volume: 4 total routes (pi_embedded: 1, native: 3)
  • Summary artifacts:
    • docs/plans/artifacts/pi_embedded_eval_window_c_guard_probes.jsonl
    • docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.md
    • docs/plans/artifacts/pi_embedded_eval_window_c_2026-02-24_guard_postprobe.json
Check Result Notes
Minimum pi_no_tools_mode guard hits 1 (pass) gate >= 1
Minimum capability_query guard hits 1 (pass) gate >= 1
Minimum attachments_present guard hits 1 (pass) gate >= 1

Tool Compatibility Findings

Track all tool-adjacent/risky prompts that were force-routed to native (no_tools_mode) and any misses.

Class Observed behavior Action
Tool-adjacent prompts Verified in controlled probes: pi_no_tools_mode guard hit = 1. Keep this in regression probes for future backend iterations.
Capability-query prompts Verified in controlled probes: capability_query guard hit = 1. Keep this in regression probes for future backend iterations.
Attachments-present turns Verified in controlled probes: attachments_present guard hit = 1. Keep this in regression probes for future backend iterations.

Decision Record

  • Decision date: February 24, 2026
  • Decision: rollback (end canary routing for now)
  • Rationale: Window A fails core numeric gates (p50 delta, p95 delta, fallback rate) with two concrete fallback failure modes:
    • pi_module_interface
    • empty_assistant_text Window B shows fallback recovery (0%) but fails minimum sample/baseline gates, so it does not overturn Window A. Window D confirms guardrail compatibility behavior in controlled probes, but guard compatibility alone is insufficient to justify expansion.
  • Next cohort/config delta: route canary users back to native path until latency/fallback defects are remediated and a fresh canary is re-approved. Applied operational rollback in runtime config: ~/.config/flynn/config.yaml now has agent_configs.pi_canary.backend: native (backup created as ~/.config/flynn/config.yaml.bak-rollback-20260223-224801).

Diagram/Protocol Impact Review

  • Reviewed: docs/architecture/AGENT_DIAGRAM.md, docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md, docs/api/PROTOCOL.md
  • Result: no runtime message-flow or protocol-shape changes; no Mermaid topology update required for this evaluation-tooling phase.