Document browser reliability layer and roadmap progress

This commit is contained in:
William Valentin
2026-02-26 14:06:53 -08:00
parent 7c904ef0fd
commit e9873ad22b
6 changed files with 52 additions and 8 deletions
+13
View File
@@ -951,12 +951,18 @@ Flynn ships these browser tools:
- `browser.click` - `browser.click`
- `browser.type` - `browser.type`
- `browser.content` - `browser.content`
- `browser.wait_for`
- `browser.assert`
- `browser.extract`
- `browser.checkpoint.save`
- `browser.checkpoint.resume`
- `browser.eval` - `browser.eval`
- `browser.evaluate` (alias of `browser.eval`) - `browser.evaluate` (alias of `browser.eval`)
These tools are backed by a Puppeteer/CDP browser manager and are only registered when `browser.enabled: true`. These tools are backed by a Puppeteer/CDP browser manager and are only registered when `browser.enabled: true`.
They can still be filtered out by tool policy (`tools.profile`, `tools.allow`, `tools.deny`). They can still be filtered out by tool policy (`tools.profile`, `tools.allow`, `tools.deny`).
At startup, Flynn logs the browser tools that remain available after policy filtering. At startup, Flynn logs the browser tools that remain available after policy filtering.
Browser runtime guardrails support domain allowlists, explicit high-risk-domain confirmation, retry controls, and a bounded workflow step budget.
```yaml ```yaml
browser: browser:
@@ -964,6 +970,13 @@ browser:
headless: true headless: true
max_pages: 5 max_pages: 5
default_timeout: 30000 default_timeout: 30000
allowed_domains: ["*.example.com"]
high_risk_domains: ["bank.example.com"]
require_confirmation_for_high_risk: true
max_workflow_steps: 120
default_retry_attempts: 1
max_retry_attempts: 5
retry_delay_ms: 250
# executable_path: /usr/bin/google-chrome # executable_path: /usr/bin/google-chrome
# ws_endpoint: ws://127.0.0.1:9222/devtools/browser/<id> # ws_endpoint: ws://127.0.0.1:9222/devtools/browser/<id>
+7
View File
@@ -1302,6 +1302,10 @@ Set callback for tool use events (for confirmation UI).
List available tools. List available tools.
When browser automation is enabled, `tools.list` may include workflow-reliability helpers such as:
`browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, and `browser.checkpoint.resume`
in addition to baseline navigation/click/type/content/eval tools.
**Request:** **Request:**
```json ```json
{ {
@@ -1338,6 +1342,9 @@ List available tools.
Execute a tool directly (bypass agent). Execute a tool directly (bypass agent).
Browser workflow tools enforce runtime guardrails configured in `browser.*`:
domain allowlists, high-risk-domain confirmation (`confirm_high_risk=true`), retry bounds, and step-budget limits.
**Request:** **Request:**
```json ```json
{ {
+2 -2
View File
@@ -25,7 +25,7 @@ Tools are executable capabilities that the AI agent can call to perform actions
- **File System**: `file.read`, `file.write`, `file.edit`, `file.list` - **File System**: `file.read`, `file.write`, `file.edit`, `file.list`
- **Shell/Process**: `shell.exec`, `process.start`, `process.kill` - **Shell/Process**: `shell.exec`, `process.start`, `process.kill`
- **Web**: `web.fetch`, `web.search` - **Web**: `web.fetch`, `web.search`
- **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`) - **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, `browser.checkpoint.resume`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`)
- **Memory**: `memory.read`, `memory.write`, `memory.search` - **Memory**: `memory.read`, `memory.write`, `memory.search`
- **MinIO**: `minio.share`, `minio.ingest`, `minio.sync` - **MinIO**: `minio.share`, `minio.ingest`, `minio.sync`
- **Kubernetes**: `k8s.pods`, `k8s.deployments`, `k8s.logs` - **Kubernetes**: `k8s.pods`, `k8s.deployments`, `k8s.logs`
@@ -330,7 +330,7 @@ Use for tools that share a common dependency or manager.
import type { Tool, ToolResult } from '../../types.js'; import type { Tool, ToolResult } from '../../types.js';
import type { BrowserManager } from './manager.js'; import type { BrowserManager } from './manager.js';
export function createBrowserTools(manager: BrowserManager): Tool[] { export function createBrowserTools(manager: BrowserManager, options?: BrowserToolsOptions): Tool[] {
return [ return [
{ {
name: 'browser.navigate', name: 'browser.navigate',
+1
View File
@@ -266,6 +266,7 @@ Flynn treats content provenance as part of the control boundary:
- `web.fetch`, `web.search`, and `browser.content` outputs are treated as untrusted "fetched_content". - `web.fetch`, `web.search`, and `browser.content` outputs are treated as untrusted "fetched_content".
- Tool results are wrapped in provenance markers inside the tool loop. - Tool results are wrapped in provenance markers inside the tool loop.
- Once untrusted content is seen, ToolExecutor applies stricter gating (blocks obvious injection patterns for high-risk tools). - Once untrusted content is seen, ToolExecutor applies stricter gating (blocks obvious injection patterns for high-risk tools).
- Browser workflow tools add execution guardrails in the tool layer: `allowed_domains`, explicit high-risk confirmations, bounded retry policies, and step-budget enforcement.
Key files: Key files:
@@ -18,6 +18,7 @@ If you only want the protocol surface, see `docs/api/PROTOCOL.md`.
- Run lifecycle/cancel intent and reaction decisions are emitted to audit logs, and aggregated into `system.metrics` counters (runStates, cancelLatencyMs, reactions) for dashboards. - Run lifecycle/cancel intent and reaction decisions are emitted to audit logs, and aggregated into `system.metrics` counters (runStates, cancelLatencyMs, reactions) for dashboards.
- Reaction matching is deterministic (priority + cooldown + recursion guard) before intent/agent routing. - Reaction matching is deterministic (priority + cooldown + recursion guard) before intent/agent routing.
- `subagent.*` tools create child orchestrators scoped to the parent conversation (`subagent:<parentSessionId>:<childId>`) with idle TTL cleanup, per-child queue mode (`followup|interrupt`), and session budgets (turn/token/timeout); this is tool-loop behavior, not a separate gateway RPC session lane. - `subagent.*` tools create child orchestrators scoped to the parent conversation (`subagent:<parentSessionId>:<childId>`) with idle TTL cleanup, per-child queue mode (`followup|interrupt`), and session budgets (turn/token/timeout); this is tool-loop behavior, not a separate gateway RPC session lane.
- Browser workflow reliability primitives (`browser.wait_for/assert/extract/checkpoint.*`) execute in the same queued session lane and apply browser-config guardrails (domain allowlist/high-risk confirmation, bounded retries, workflow step budget).
- Companion `node.*` registration is per WebSocket connection; reconnects must re-register capabilities before invoking node RPC methods. - Companion `node.*` registration is per WebSocket connection; reconnects must re-register capabilities before invoking node RPC methods.
- Canvas artifacts are persisted per session under the gateway data directory for UI recovery across restarts. - Canvas artifacts are persisted per session under the gateway data directory for UI recovery across restarts.
- TTS output is best-effort; synthesis failures fall back to text-only responses. - TTS output is best-effort; synthesis failures fall back to text-only responses.
+28 -6
View File
@@ -6786,15 +6786,37 @@
"test_status": "docs only" "test_status": "docs only"
}, },
"personal-assistant-productization-plan-2026-02-26": { "personal-assistant-productization-plan-2026-02-26": {
"status": "proposed", "status": "in_progress",
"date": "2026-02-26", "date": "2026-02-26",
"updated": "2026-02-26", "updated": "2026-02-26",
"summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week productization roadmap focused on shipped companion apps, voice daily-driver reliability, browser workflow reliability, and onboarding first-success funnel metrics.", "summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week roadmap. Phase 3 browser reliability work is now shipped (workflow primitives, retry/budget/guardrails, checkpoints), with companion/voice/onboarding phases remaining.",
"files_modified": [ "files_modified": [
"docs/plans/2026-02-26-personal-assistant-productization-plan.md", "docs/plans/2026-02-26-personal-assistant-productization-plan.md",
"docs/plans/state.json" "docs/plans/state.json"
], ],
"test_status": "planning/docs update only; no runtime code changes" "test_status": "roadmap status updated; implementation tracked in phase-specific entries"
},
"personal-assistant-productization-phase3-browser-reliability": {
"status": "completed",
"date": "2026-02-26",
"updated": "2026-02-26",
"summary": "Implemented Phase 3 browser workflow reliability layer: added `browser.wait_for`, `browser.assert`, `browser.extract`, checkpoint save/resume tools, retry wrappers, domain allowlist + high-risk confirmation guardrails, and bounded workflow-step budgets wired through config and daemon registration.",
"files_modified": [
"src/tools/builtin/browser/tools.ts",
"src/tools/builtin/browser/tools.test.ts",
"src/daemon/tools.ts",
"src/tools/policy.ts",
"src/config/schema.ts",
"src/config/schema.test.ts",
"config/default.yaml",
"README.md",
"docs/api/TOOLS.md",
"docs/api/PROTOCOL.md",
"docs/architecture/AGENT_DIAGRAM.md",
"docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md",
"docs/plans/state.json"
],
"test_status": "pnpm test:run src/tools/builtin/browser/tools.test.ts src/config/schema.test.ts src/tools/policy.test.ts + pnpm typecheck passing"
}, },
"subagents-support-phase1": { "subagents-support-phase1": {
"status": "completed", "status": "completed",
@@ -6830,7 +6852,7 @@
} }
}, },
"overall_progress": { "overall_progress": {
"total_test_count": 2534, "total_test_count": 2544,
"all_tests_passing": true, "all_tests_passing": true,
"p0_completion": "3/3 (100%)", "p0_completion": "3/3 (100%)",
"p1_completion": "4/4 (100%)", "p1_completion": "4/4 (100%)",
@@ -6845,7 +6867,7 @@
"tier2_completion": "4/4 (100%) \u2014 inbound webhooks, vector memory search, Dockerfile, heartbeat monitor", "tier2_completion": "4/4 (100%) \u2014 inbound webhooks, vector memory search, Dockerfile, heartbeat monitor",
"tier3_completion": "5/5 (100%) \u2014 lane queue, credential redaction, web UI token dashboard, xAI (Grok) provider, Voyage AI embeddings", "tier3_completion": "5/5 (100%) \u2014 lane queue, credential redaction, web UI token dashboard, xAI (Grok) provider, Voyage AI embeddings",
"tier4_completion": "4/4 (100%) \u2014 gateway lock, shell completion, Tailscale Serve/Funnel, DM pairing codes", "tier4_completion": "4/4 (100%) \u2014 gateway lock, shell completion, Tailscale Serve/Funnel, DM pairing codes",
"feature_gap_scorecard": "rebaselined 2026-02-26 — channel breadth, setup wizard, baseline browser automation, and full subagent support (`subagent.*` + queue modes + budgets + trace/audit + `/subagents` inspection) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, browser workflow reliability primitives, and first-success onboarding funnel optimization.", "feature_gap_scorecard": "rebaselined 2026-02-26 and updated 2026-02-26 (phase 3) — channel breadth, setup wizard, baseline browser automation, subagent controls, and browser workflow reliability primitives (wait/assert/extract/retries/checkpoints/guardrails/budgets) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, and first-success onboarding funnel optimization.",
"operator_dx_milestone": "Phase 3 (Live Ops Dashboard): 2/2 plans complete \u2014 milestone done", "operator_dx_milestone": "Phase 3 (Live Ops Dashboard): 2/2 plans complete \u2014 milestone done",
"dashboard_observability": "completed \u2014 service health graphs + core service log viewer added to web UI via observability RPCs and bounded backend sampling", "dashboard_observability": "completed \u2014 service health graphs + core service log viewer added to web UI via observability RPCs and bounded backend sampling",
"gmail_auth_cli": "flynn gmail-auth command implemented with OAuth2 flow, doctor check, config routed to Telegram", "gmail_auth_cli": "flynn gmail-auth command implemented with OAuth2 flow, doctor check, config routed to Telegram",
@@ -6878,7 +6900,7 @@
"deeper_surfaces_phase3_companion_canvas_voice": "completed \u2014 companion reconnect resilience (auto-reconnect with backoff, pending-wait cancellation on disconnect), canvas artifact persistence (SQLite-backed store, daemon-restart durability), voice TTS fallback coverage (text-only reply on TTS failure, no dropped responses)", "deeper_surfaces_phase3_companion_canvas_voice": "completed \u2014 companion reconnect resilience (auto-reconnect with backoff, pending-wait cancellation on disconnect), canvas artifact persistence (SQLite-backed store, daemon-restart durability), voice TTS fallback coverage (text-only reply on TTS failure, no dropped responses)",
"deeper_surfaces_phase4_rollout": "completed \u2014 phase 4 rollout and operator readiness plan documented: canary rollout plan by feature flag/surface, explicit rollback playbook, operator docs and architecture/protocol docs synchronized", "deeper_surfaces_phase4_rollout": "completed \u2014 phase 4 rollout and operator readiness plan documented: canary rollout plan by feature flag/surface, explicit rollback playbook, operator docs and architecture/protocol docs synchronized",
"post_phase_test_fixes": "completed \u2014 fixed 4 test failures introduced by phases 1-3: iOS/Android push listNodes (missing publishHeartbeat before platform-filtered query), server.test agent.send (run_state events now precede done; added sendAndWaitForDone helper), httpBody 413 (req.destroy() closed socket before response could be sent; replaced with Connection: close header on 413 responses)", "post_phase_test_fixes": "completed \u2014 fixed 4 test failures introduced by phases 1-3: iOS/Android push listNodes (missing publishHeartbeat before platform-filtered query), server.test agent.send (run_state events now precede done; added sendAndWaitForDone helper), httpBody 413 (req.destroy() closed socket before response could be sent; replaced with Connection: close header on 413 responses)",
"personal_assistant_productization_plan": "proposed \u2014 8-10 week phased roadmap defined (companion MVP surfaces, voice reliability hardening, browser workflow reliability layer, onboarding 2.0 first-success funnel) with measurable exit gates.", "personal_assistant_productization_plan": "in_progress \u2014 8-10 week phased roadmap active; Phase 3 browser workflow reliability layer shipped (wait/assert/extract/checkpoints + guardrails/retries/budgets). Remaining phases: companion MVP surfaces, voice reliability hardening, and onboarding 2.0 first-success funnel.",
"subagents_support": "completed \u2014 subagent phases 1-3 shipped with `subagent.spawn/send/list/cancel/delete/summary`, per-child queue mode (`followup|interrupt`), budgets (`max_turns`, `max_total_tokens`, `turn_timeout_ms`), tool-profile overrides, trace-linked audit events, `/subagents` inspection commands, and focused regression tests." "subagents_support": "completed \u2014 subagent phases 1-3 shipped with `subagent.spawn/send/list/cancel/delete/summary`, per-child queue mode (`followup|interrupt`), budgets (`max_turns`, `max_total_tokens`, `turn_timeout_ms`), tool-profile overrides, trace-linked audit events, `/subagents` inspection commands, and focused regression tests."
}, },
"soul_md_and_cron_create": { "soul_md_and_cron_create": {