Files
flynn/docs/plans/2026-02-25-phase4-rollout-operator-readiness.md
T
2026-02-25 11:23:47 -08:00

3.4 KiB

Phase 4 Rollout + Operator Readiness (Deeper Surfaces)

Date: 2026-02-25

Summary

This document provides the rollout plan, rollback playbook, and operator readiness checklist for the deeper end-user surfaces + integrated behavior stack workstreams (run-control, reactions v2, companion/canvas/voice).

Canary Rollout Plan

Guarded Rollout Steps

  1. Run-control semantics (Phase 1) Toggle: server.queue.mode: interrupt only for canary sessions via server.queue.overrides.sessions. Gate: cancel-to-ack p95 <= 500ms, zero duplicate final responses in integration tests. Observe: run_state events (start, cancel_requested, cancelled, complete, error) in gateway UI + audit logs.

  2. Reactions v2 (Phase 2) Toggle: restrict automation.reactions list to canary rules + scoped triggers. Gate: reaction false-positive rate <= 3% in audit logs (reactionMatch, reactionSkip). Observe: system.metrics reaction counters + recursion guard skip reasons.

  3. Companion + Canvas (Phase 3) Toggle: server.nodes.enabled: true for companion canary nodes, enable server.nodes.feature_gates.ui.canvas. Gate: companion reconnect success >= 99% in soak; canvas artifacts survive restart in integration runs. Observe: node registration + capability logs; canvas list/get/put success in gateway UI.

  4. Voice Continuity (Phase 3) Toggle: tts.enabled: true and tts.enabled_channels for canary channels; audio.enabled: true for inbound voice. Gate: no dropped responses when TTS fails; text-only fallback confirmed in tests. Observe: warning logs for TTS failures, reply delivery counts.

Rollout Cadence

  1. Week 1: enable canary on a single internal channel + 1-2 sessions.
  2. Week 2: expand to 5-10% sessions/channels after gates hold.
  3. Week 3: expand to 25-50% after second gate review.
  4. Week 4: default-on unless gates fail; keep toggles for rollback.

Rollback Playbook

  1. Run-control rollback Set server.queue.mode: collect globally. Remove canary overrides in server.queue.overrides.sessions.

  2. Reactions rollback Set automation.reactions: [] or remove canary rules. Verify reactionMatch count drops to zero.

  3. Companion rollback Set server.nodes.enabled: false (or restrict allowed_roles to none). Clear companion node registrations by restarting gateway.

  4. Canvas rollback Disable ui.canvas in server.nodes.feature_gates. Optional: archive/remove dataDir/canvas after capture if needed.

  5. Voice rollback Set tts.enabled: false and/or remove tts.enabled_channels. Set audio.enabled: false to stop inbound voice processing.

Operator Readiness Checklist

Confirm protocol and architecture docs are synchronized (docs/api/PROTOCOL.md, docs/architecture/AGENT_DIAGRAM.md, docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md). Verify audit logs and system.metrics are capturing run_state transitions, cancel latency buckets, and reaction match/skip reasons. Validate canary tests: run-control queue preemption + cancel, reaction priority/cooldown, companion reconnect + re-register, canvas persistence across restart, TTS failure fallback. Capture a before/after snapshot of error rate, cancellation latency, reaction false positives, companion reconnect success.

Owner + Comms

  • Primary owner: Flynn core team
  • Canary checkpoint cadence: weekly
  • Escalation: revert via rollback playbook within 1 hour of gate breach