Files
flynn/docs/plans/pi_embedded_evaluation.md
T

3.1 KiB

Pi Embedded Canary Evaluation (Phase 2)

Status: in progress
Owner: Flynn maintainers
Started: 2026-02-24

Goal

Close the canary spike with a formal, repeatable evaluation and an explicit rollout decision (expand, hold, or rollback).

Scope

  • Target backend: pi_embedded
  • Target cohort (current): telegram:8367012007
  • Baseline backend: native
  • Data source: audit events (backend.route, backend.success, backend.fallback, session.message)

Pass/Fail Gate

Use the same thresholds for every evaluation window.

Metric Gate
Completion rate delta (target - baseline) >= -2.00pp
P50 latency delta (target - baseline) <= +250ms
P95 latency delta (target - baseline) <= +700ms
External fallback rate (pi_embedded) <= 5.00%
Guardrail escapes 0 unresolved

Notes:

  • Completion rate and latency are computed from route-to-assistant turn timings.
  • Fallback rate is computed from backend.success + backend.fallback attempt outcomes.
  • Guardrail escapes are reviewed from backend.route.source == forced_native_guard + operator incident review.

How To Run

Run a canary summary for the current cohort:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --format markdown

Run with gate evaluation and emit JSON artifact:

pnpm audit:backend-canary \
  --audit ~/.local/share/flynn/audit.log \
  --backend pi_embedded \
  --baseline native \
  --session telegram:8367012007 \
  --format json \
  --out docs/plans/artifacts/pi_embedded_eval_latest.json \
  --gate-max-completion-drop-pp 2 \
  --gate-max-p50-latency-increase-ms 250 \
  --gate-max-p95-latency-increase-ms 700 \
  --gate-max-fallback-rate-pct 5

Evaluation Log

Window A

  • Dates: TBD
  • Route volume: TBD
  • Summary artifact: TBD
Check Result Notes
Completion rate delta TBD
P50 latency delta TBD
P95 latency delta TBD
Fallback rate TBD
Guardrail escapes TBD

Window B

  • Dates: TBD
  • Route volume: TBD
  • Summary artifact: TBD
Check Result Notes
Completion rate delta TBD
P50 latency delta TBD
P95 latency delta TBD
Fallback rate TBD
Guardrail escapes TBD

Tool Compatibility Findings

Track all tool-adjacent/risky prompts that were force-routed to native (no_tools_mode) and any misses.

Class Observed behavior Action
Tool-adjacent prompts TBD
Capability-query prompts TBD
Attachments-present turns TBD

Decision Record

  • Decision date: TBD
  • Decision: expand | hold | rollback
  • Rationale: TBD
  • Next cohort/config delta: TBD

Diagram/Protocol Impact Review

  • Reviewed: docs/architecture/AGENT_DIAGRAM.md, docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md, docs/api/PROTOCOL.md
  • Result: no runtime message-flow or protocol-shape changes; no Mermaid topology update required for this evaluation-tooling phase.