Document browser reliability layer and roadmap progress

This commit is contained in:
William Valentin
2026-02-26 14:06:53 -08:00
parent 7c904ef0fd
commit e9873ad22b
6 changed files with 52 additions and 8 deletions
+13
View File
@@ -951,12 +951,18 @@ Flynn ships these browser tools:
- `browser.click`
- `browser.type`
- `browser.content`
- `browser.wait_for`
- `browser.assert`
- `browser.extract`
- `browser.checkpoint.save`
- `browser.checkpoint.resume`
- `browser.eval`
- `browser.evaluate` (alias of `browser.eval`)
These tools are backed by a Puppeteer/CDP browser manager and are only registered when `browser.enabled: true`.
They can still be filtered out by tool policy (`tools.profile`, `tools.allow`, `tools.deny`).
At startup, Flynn logs the browser tools that remain available after policy filtering.
Browser runtime guardrails support domain allowlists, explicit high-risk-domain confirmation, retry controls, and a bounded workflow step budget.
```yaml
browser:
@@ -964,6 +970,13 @@ browser:
headless: true
max_pages: 5
default_timeout: 30000
allowed_domains: ["*.example.com"]
high_risk_domains: ["bank.example.com"]
require_confirmation_for_high_risk: true
max_workflow_steps: 120
default_retry_attempts: 1
max_retry_attempts: 5
retry_delay_ms: 250
# executable_path: /usr/bin/google-chrome
# ws_endpoint: ws://127.0.0.1:9222/devtools/browser/<id>
+7
View File
@@ -1302,6 +1302,10 @@ Set callback for tool use events (for confirmation UI).
List available tools.
When browser automation is enabled, `tools.list` may include workflow-reliability helpers such as:
`browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, and `browser.checkpoint.resume`
in addition to baseline navigation/click/type/content/eval tools.
**Request:**
```json
{
@@ -1338,6 +1342,9 @@ List available tools.
Execute a tool directly (bypass agent).
Browser workflow tools enforce runtime guardrails configured in `browser.*`:
domain allowlists, high-risk-domain confirmation (`confirm_high_risk=true`), retry bounds, and step-budget limits.
**Request:**
```json
{
+2 -2
View File
@@ -25,7 +25,7 @@ Tools are executable capabilities that the AI agent can call to perform actions
- **File System**: `file.read`, `file.write`, `file.edit`, `file.list`
- **Shell/Process**: `shell.exec`, `process.start`, `process.kill`
- **Web**: `web.fetch`, `web.search`
- **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`)
- **Browser**: `browser.navigate`, `browser.screenshot`, `browser.click`, `browser.type`, `browser.content`, `browser.wait_for`, `browser.assert`, `browser.extract`, `browser.checkpoint.save`, `browser.checkpoint.resume`, `browser.eval`, `browser.evaluate` (alias of `browser.eval`)
- **Memory**: `memory.read`, `memory.write`, `memory.search`
- **MinIO**: `minio.share`, `minio.ingest`, `minio.sync`
- **Kubernetes**: `k8s.pods`, `k8s.deployments`, `k8s.logs`
@@ -330,7 +330,7 @@ Use for tools that share a common dependency or manager.
import type { Tool, ToolResult } from '../../types.js';
import type { BrowserManager } from './manager.js';
export function createBrowserTools(manager: BrowserManager): Tool[] {
export function createBrowserTools(manager: BrowserManager, options?: BrowserToolsOptions): Tool[] {
return [
{
name: 'browser.navigate',
+1
View File
@@ -266,6 +266,7 @@ Flynn treats content provenance as part of the control boundary:
- `web.fetch`, `web.search`, and `browser.content` outputs are treated as untrusted "fetched_content".
- Tool results are wrapped in provenance markers inside the tool loop.
- Once untrusted content is seen, ToolExecutor applies stricter gating (blocks obvious injection patterns for high-risk tools).
- Browser workflow tools add execution guardrails in the tool layer: `allowed_domains`, explicit high-risk confirmations, bounded retry policies, and step-budget enforcement.
Key files:
@@ -18,6 +18,7 @@ If you only want the protocol surface, see `docs/api/PROTOCOL.md`.
- Run lifecycle/cancel intent and reaction decisions are emitted to audit logs, and aggregated into `system.metrics` counters (runStates, cancelLatencyMs, reactions) for dashboards.
- Reaction matching is deterministic (priority + cooldown + recursion guard) before intent/agent routing.
- `subagent.*` tools create child orchestrators scoped to the parent conversation (`subagent:<parentSessionId>:<childId>`) with idle TTL cleanup, per-child queue mode (`followup|interrupt`), and session budgets (turn/token/timeout); this is tool-loop behavior, not a separate gateway RPC session lane.
- Browser workflow reliability primitives (`browser.wait_for/assert/extract/checkpoint.*`) execute in the same queued session lane and apply browser-config guardrails (domain allowlist/high-risk confirmation, bounded retries, workflow step budget).
- Companion `node.*` registration is per WebSocket connection; reconnects must re-register capabilities before invoking node RPC methods.
- Canvas artifacts are persisted per session under the gateway data directory for UI recovery across restarts.
- TTS output is best-effort; synthesis failures fall back to text-only responses.
+28 -6
View File
@@ -6786,15 +6786,37 @@
"test_status": "docs only"
},
"personal-assistant-productization-plan-2026-02-26": {
"status": "proposed",
"status": "in_progress",
"date": "2026-02-26",
"updated": "2026-02-26",
"summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week productization roadmap focused on shipped companion apps, voice daily-driver reliability, browser workflow reliability, and onboarding first-success funnel metrics.",
"summary": "Rebaselined Flynn's OpenClaw-style personal-assistant gaps and defined an execution-ready 8-10 week roadmap. Phase 3 browser reliability work is now shipped (workflow primitives, retry/budget/guardrails, checkpoints), with companion/voice/onboarding phases remaining.",
"files_modified": [
"docs/plans/2026-02-26-personal-assistant-productization-plan.md",
"docs/plans/state.json"
],
"test_status": "planning/docs update only; no runtime code changes"
"test_status": "roadmap status updated; implementation tracked in phase-specific entries"
},
"personal-assistant-productization-phase3-browser-reliability": {
"status": "completed",
"date": "2026-02-26",
"updated": "2026-02-26",
"summary": "Implemented Phase 3 browser workflow reliability layer: added `browser.wait_for`, `browser.assert`, `browser.extract`, checkpoint save/resume tools, retry wrappers, domain allowlist + high-risk confirmation guardrails, and bounded workflow-step budgets wired through config and daemon registration.",
"files_modified": [
"src/tools/builtin/browser/tools.ts",
"src/tools/builtin/browser/tools.test.ts",
"src/daemon/tools.ts",
"src/tools/policy.ts",
"src/config/schema.ts",
"src/config/schema.test.ts",
"config/default.yaml",
"README.md",
"docs/api/TOOLS.md",
"docs/api/PROTOCOL.md",
"docs/architecture/AGENT_DIAGRAM.md",
"docs/architecture/GATEWAY_SESSIONS_AND_QUEUE.md",
"docs/plans/state.json"
],
"test_status": "pnpm test:run src/tools/builtin/browser/tools.test.ts src/config/schema.test.ts src/tools/policy.test.ts + pnpm typecheck passing"
},
"subagents-support-phase1": {
"status": "completed",
@@ -6830,7 +6852,7 @@
}
},
"overall_progress": {
"total_test_count": 2534,
"total_test_count": 2544,
"all_tests_passing": true,
"p0_completion": "3/3 (100%)",
"p1_completion": "4/4 (100%)",
@@ -6845,7 +6867,7 @@
"tier2_completion": "4/4 (100%) \u2014 inbound webhooks, vector memory search, Dockerfile, heartbeat monitor",
"tier3_completion": "5/5 (100%) \u2014 lane queue, credential redaction, web UI token dashboard, xAI (Grok) provider, Voyage AI embeddings",
"tier4_completion": "4/4 (100%) \u2014 gateway lock, shell completion, Tailscale Serve/Funnel, DM pairing codes",
"feature_gap_scorecard": "rebaselined 2026-02-26 — channel breadth, setup wizard, baseline browser automation, and full subagent support (`subagent.*` + queue modes + budgets + trace/audit + `/subagents` inspection) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, browser workflow reliability primitives, and first-success onboarding funnel optimization.",
"feature_gap_scorecard": "rebaselined 2026-02-26 and updated 2026-02-26 (phase 3) — channel breadth, setup wizard, baseline browser automation, subagent controls, and browser workflow reliability primitives (wait/assert/extract/retries/checkpoints/guardrails/budgets) are implemented; remaining high-impact personal-assistant gaps center on shipped companion apps (desktop/mobile), voice UX polish, and first-success onboarding funnel optimization.",
"operator_dx_milestone": "Phase 3 (Live Ops Dashboard): 2/2 plans complete \u2014 milestone done",
"dashboard_observability": "completed \u2014 service health graphs + core service log viewer added to web UI via observability RPCs and bounded backend sampling",
"gmail_auth_cli": "flynn gmail-auth command implemented with OAuth2 flow, doctor check, config routed to Telegram",
@@ -6878,7 +6900,7 @@
"deeper_surfaces_phase3_companion_canvas_voice": "completed \u2014 companion reconnect resilience (auto-reconnect with backoff, pending-wait cancellation on disconnect), canvas artifact persistence (SQLite-backed store, daemon-restart durability), voice TTS fallback coverage (text-only reply on TTS failure, no dropped responses)",
"deeper_surfaces_phase4_rollout": "completed \u2014 phase 4 rollout and operator readiness plan documented: canary rollout plan by feature flag/surface, explicit rollback playbook, operator docs and architecture/protocol docs synchronized",
"post_phase_test_fixes": "completed \u2014 fixed 4 test failures introduced by phases 1-3: iOS/Android push listNodes (missing publishHeartbeat before platform-filtered query), server.test agent.send (run_state events now precede done; added sendAndWaitForDone helper), httpBody 413 (req.destroy() closed socket before response could be sent; replaced with Connection: close header on 413 responses)",
"personal_assistant_productization_plan": "proposed \u2014 8-10 week phased roadmap defined (companion MVP surfaces, voice reliability hardening, browser workflow reliability layer, onboarding 2.0 first-success funnel) with measurable exit gates.",
"personal_assistant_productization_plan": "in_progress \u2014 8-10 week phased roadmap active; Phase 3 browser workflow reliability layer shipped (wait/assert/extract/checkpoints + guardrails/retries/budgets). Remaining phases: companion MVP surfaces, voice reliability hardening, and onboarding 2.0 first-success funnel.",
"subagents_support": "completed \u2014 subagent phases 1-3 shipped with `subagent.spawn/send/list/cancel/delete/summary`, per-child queue mode (`followup|interrupt`), budgets (`max_turns`, `max_total_tokens`, `turn_timeout_ms`), tool-profile overrides, trace-linked audit events, `/subagents` inspection commands, and focused regression tests."
},
"soul_md_and_cron_create": {