Document browser reliability layer and roadmap progress

2026-02-26 14:06:53 -08:00
parent 7c904ef0fd
commit e9873ad22b
6 changed files with 52 additions and 8 deletions
@@ -951,12 +951,18 @@ Flynn ships these browser tools:
 - `browser.click`
 - `browser.type`
 - `browser.content`
+- `browser.wait_for`
+- `browser.assert`
+- `browser.extract`
+- `browser.checkpoint.save`
+- `browser.checkpoint.resume`
 - `browser.eval`
 - `browser.evaluate` (alias of `browser.eval`)

 These tools are backed by a Puppeteer/CDP browser manager and are only registered when `browser.enabled: true`.
 They can still be filtered out by tool policy (`tools.profile`, `tools.allow`, `tools.deny`).
 At startup, Flynn logs the browser tools that remain available after policy filtering.
+Browser runtime guardrails support domain allowlists, explicit high-risk-domain confirmation, retry controls, and a bounded workflow step budget.

 ```yaml
 browser:
@@ -964,6 +970,13 @@ browser:
  headless: true
  max_pages: 5
  default_timeout: 30000
+  allowed_domains: ["*.example.com"]
+  high_risk_domains: ["bank.example.com"]
+  require_confirmation_for_high_risk: true
+  max_workflow_steps: 120
+  default_retry_attempts: 1
+  max_retry_attempts: 5
+  retry_delay_ms: 250
  # executable_path: /usr/bin/google-chrome
  # ws_endpoint: ws://127.0.0.1:9222/devtools/browser/<id>

@@ -1302,6 +1302,10 @@ Set callback for tool use events (for confirmation UI).

 List available tools.

+When browser automation is enabled, `tools.list` may include workflow-reliability helpers such as:
+`browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, and `browser.checkpoint.resume`
+in addition to baseline navigation/click/type/content/eval tools.
+
 **Request:**
 ```json
 {
@@ -1338,6 +1342,9 @@ List available tools.

 Execute a tool directly (bypass agent).

+Browser workflow tools enforce runtime guardrails configured in `browser.*`:
+domain allowlists, high-risk-domain confirmation (`confirm_high_risk=true`), retry bounds, and step-budget limits.
+
 **Request:**
 ```json
 {
@@ -25,7 +25,7 @@ Tools are executable capabilities that the AI agent can call to perform actions
 - **File System**: `file.read`, `file.write`, `file.edit`, `file.list`
 - **Shell/Process**: `shell.exec`, `process.start`, `process.kill`
 - **Web**: `web.fetch`, `web.search`
- **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`)
+- **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, `browser.checkpoint.resume`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`)
 - **Memory**: `memory.read`, `memory.write`, `memory.search`
 - **MinIO**: `minio.share`, `minio.ingest`, `minio.sync`
 - **Kubernetes**: `k8s.pods`, `k8s.deployments`, `k8s.logs`
@@ -330,7 +330,7 @@ Use for tools that share a common dependency or manager.
 import type { Tool, ToolResult } from '../../types.js';
 import type { BrowserManager } from './manager.js';

-export function createBrowserTools(manager: BrowserManager): Tool[] {
+export function createBrowserTools(manager: BrowserManager, options?: BrowserToolsOptions): Tool[] {
  return [
    {
      name: 'browser.navigate',
@@ -266,6 +266,7 @@ Flynn treats content provenance as part of the control boundary:
 - `web.fetch`, `web.search`, and `browser.content` outputs are treated as untrusted "fetched_content".
 - Tool results are wrapped in provenance markers inside the tool loop.
 - Once untrusted content is seen, ToolExecutor applies stricter gating (blocks obvious injection patterns for high-risk tools).
+- Browser workflow tools add execution guardrails in the tool layer: `allowed_domains`, explicit high-risk confirmations, bounded retry policies, and step-budget enforcement.

 Key files:

@@ -18,6 +18,7 @@ If you only want the protocol surface, see `docs/api/PROTOCOL.md`.
 - Run lifecycle/cancel intent and reaction decisions are emitted to audit logs, and aggregated into `system.metrics` counters (runStates, cancelLatencyMs, reactions) for dashboards.
 - Reaction matching is deterministic (priority + cooldown + recursion guard) before intent/agent routing.
 - `subagent.*` tools create child orchestrators scoped to the parent conversation (`subagent:<parentSessionId>:<childId>`) with idle TTL cleanup, per-child queue mode (`followup|interrupt`), and session budgets (turn/token/timeout); this is tool-loop behavior, not a separate gateway RPC session lane.
+- Browser workflow reliability primitives (`browser.wait_for/assert/extract/checkpoint.*`) execute in the same queued session lane and apply browser-config guardrails (domain allowlist/high-risk confirmation, bounded retries, workflow step budget).
 - Companion `node.*` registration is per WebSocket connection; reconnects must re-register capabilities before invoking node RPC methods.
 - Canvas artifacts are persisted per session under the gateway data directory for UI recovery across restarts.
 - TTS output is best-effort; synthesis failures fall back to text-only responses.
@@ -6786,15 +6786,37 @@
      "test_status": "docs only"
    },
    "personal-assistant-productization-plan-2026-02-26": {
-      "status": "proposed",
+      "status": "in_progress",
      "date": "2026-02-26",
      "updated": "2026-02-26",
-      "summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week productization roadmap focused on shipped companion apps, voice daily-driver reliability, browser workflow reliability, and onboarding first-success funnel metrics.",
+      "summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week roadmap. Phase 3 browser reliability work is now shipped (workflow primitives, retry/budget/guardrails, checkpoints), with companion/voice/onboarding phases remaining.",
      "files_modified": [
        "docs/plans/2026-02-26-personal-assistant-productization-plan.md",
        "docs/plans/state.json"
      ],
-      "test_status": "planning/docs update only; no runtime code changes"
+      "test_status": "roadmap status updated; implementation tracked in phase-specific entries"
+    },
+    "personal-assistant-productization-phase3-browser-reliability": {
+      "status": "completed",
+      "date": "2026-02-26",
+      "updated": "2026-02-26",
+      "summary": "Implemented Phase 3 browser workflow reliability layer: added `browser.wait_for`, `browser.assert`, `browser.extract`, checkpoint save/resume tools, retry wrappers, domain allowlist + high-risk confirmation guardrails, and bounded workflow-step budgets wired through config and daemon registration.",
+      "files_modified": [
+        "src/tools/builtin/browser/tools.ts",
+        "src/tools/builtin/browser/tools.test.ts",
+        "src/daemon/tools.ts",
+        "src/tools/policy.ts",
+        "src/config/schema.ts",
+        "src/config/schema.test.ts",
+        "config/default.yaml",
+        "README.md",
+        "docs/api/TOOLS.md",
+        "docs/api/PROTOCOL.md",
+        "docs/architecture/AGENT_DIAGRAM.md",
+        "docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md",
+        "docs/plans/state.json"
+      ],
+      "test_status": "pnpm test:run src/tools/builtin/browser/tools.test.ts src/config/schema.test.ts src/tools/policy.test.ts + pnpm typecheck passing"
    },
    "subagents-support-phase1": {
      "status": "completed",
@@ -6830,7 +6852,7 @@
    }
  },
  "overall_progress": {
-    "total_test_count": 2534,
+    "total_test_count": 2544,
    "all_tests_passing": true,
    "p0_completion": "3/3 (100%)",
    "p1_completion": "4/4 (100%)",
@@ -6845,7 +6867,7 @@
    "tier2_completion": "4/4 (100%) \u2014 inbound webhooks, vector memory search, Dockerfile, heartbeat monitor",
    "tier3_completion": "5/5 (100%) \u2014 lane queue, credential redaction, web UI token dashboard, xAI (Grok) provider, Voyage AI embeddings",
    "tier4_completion": "4/4 (100%) \u2014 gateway lock, shell completion, Tailscale Serve/Funnel, DM pairing codes",
-    "feature_gap_scorecard": "rebaselined 2026-02-26 — channel breadth, setup wizard, baseline browser automation, and full subagent support (`subagent.*` + queue modes + budgets + trace/audit + `/subagents` inspection) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, browser workflow reliability primitives, and first-success onboarding funnel optimization.",
+    "feature_gap_scorecard": "rebaselined 2026-02-26 and updated 2026-02-26 (phase 3) — channel breadth, setup wizard, baseline browser automation, subagent controls, and browser workflow reliability primitives (wait/assert/extract/retries/checkpoints/guardrails/budgets) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, and first-success onboarding funnel optimization.",
    "operator_dx_milestone": "Phase 3 (Live Ops Dashboard): 2/2 plans complete \u2014 milestone done",
    "dashboard_observability": "completed \u2014 service health graphs + core service log viewer added to web UI via observability RPCs and bounded backend sampling",
    "gmail_auth_cli": "flynn gmail-auth command implemented with OAuth2 flow, doctor check, config routed to Telegram",
@@ -6878,7 +6900,7 @@
    "deeper_surfaces_phase3_companion_canvas_voice": "completed \u2014 companion reconnect resilience (auto-reconnect with backoff, pending-wait cancellation on disconnect), canvas artifact persistence (SQLite-backed store, daemon-restart durability), voice TTS fallback coverage (text-only reply on TTS failure, no dropped responses)",
    "deeper_surfaces_phase4_rollout": "completed \u2014 phase 4 rollout and operator readiness plan documented: canary rollout plan by feature flag/surface, explicit rollback playbook, operator docs and architecture/protocol docs synchronized",
    "post_phase_test_fixes": "completed \u2014 fixed 4 test failures introduced by phases 1-3: iOS/Android push listNodes (missing publishHeartbeat before platform-filtered query), server.test agent.send (run_state events now precede done; added sendAndWaitForDone helper), httpBody 413 (req.destroy() closed socket before response could be sent; replaced with Connection: close header on 413 responses)",
-    "personal_assistant_productization_plan": "proposed \u2014 8-10 week phased roadmap defined (companion MVP surfaces, voice reliability hardening, browser workflow reliability layer, onboarding 2.0 first-success funnel) with measurable exit gates.",
+    "personal_assistant_productization_plan": "in_progress \u2014 8-10 week phased roadmap active; Phase 3 browser workflow reliability layer shipped (wait/assert/extract/checkpoints + guardrails/retries/budgets). Remaining phases: companion MVP surfaces, voice reliability hardening, and onboarding 2.0 first-success funnel.",
    "subagents_support": "completed \u2014 subagent phases 1-3 shipped with `subagent.spawn/send/list/cancel/delete/summary`, per-child queue mode (`followup|interrupt`), budgets (`max_turns`, `max_total_tokens`, `turn_timeout_ms`), tool-profile overrides, trace-linked audit events, `/subagents` inspection commands, and focused regression tests."
  },
  "soul_md_and_cron_create": {