Files
flynn/docs/plans/2026-02-14-openclaw-style-personal-agent-without-openclaw-risks-plan.md
T

160 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OpenClaw-Style Personal Agent (Without OpenClaw Risks) — Milestone Plan
**Date:** 2026-02-14
**Scope:** Make Flynn feel as efficient as a productized personal assistant (OpenClaw-like), while explicitly avoiding the structural security risks of a broad “skills marketplace + many surfaces + powerful tools” system.
**Context:** Flynn already has strong core architecture (multi-channel daemon, gateway/UI, tool policy, Docker sandboxing, memory, automation). This milestone is about tightening the trust boundary and making the safe path the default path.
## Goals
1. **Safe-by-default extension model**: “skills” are locally-auditable, capability-declared, and policy-enforced.
2. **Sandbox-by-default for risk**: any skill/tool execution with meaningful blast radius runs in an isolation boundary by default.
3. **Prompt-injection resistance**: fetched/untrusted content cannot directly drive tool calls or secret exfiltration.
4. **Operational clarity**: high-quality audit trails of tool calls, approvals, and materialized changes.
5. **Product efficiency parity**: fast onboarding and “works everywhere I already talk” without expanding the attack surface indiscriminately.
## Non-Goals (for this milestone)
- A public “skills marketplace” or auto-installing third-party skills.
- Unbounded remote device/node execution.
- Full browser RPA parity (can be a later milestone).
## Threat Model (Practical)
We assume:
- An attacker can send messages on at least one channel surface (DM or group), and can embed prompt-injection payloads in web pages, email bodies, PDFs, etc.
- The assistant has access to local files, network, and credential material (directly or indirectly).
We must prevent:
- **Unauthorized tool execution** (filesystem writes, shell, browser automation, network posts).
- **Data exfiltration** (secrets and private data leaving the machine via model/tool).
- **Supply-chain compromise via skills** (untrusted code/commands running because “its a skill”).
## Design Principles
- **Capability declarations beat intentions**: a skills declared permissions determine what it can do, not the LLMs narrative.
- **Deny by default, allow narrowly**: explicit whitelists for tools, paths, and network hosts.
- **Untrusted content isolation**: fetched content is treated as data, never as instructions.
- **Make approvals boring**: human-in-the-loop prompts are short, scoped, and explain impact.
## Milestone Structure (PR-Sized Slices)
### PR 1: Capability Manifests + Policy Binding (Skills)
**Outcome:** Every skill has a manifest that declares *what it is allowed to do*, and Flynn enforces that at runtime.
**Work items**
- Define `skills/manifest.json` schema extension:
- `permissions.tool_groups`: e.g. `["group:fs", "group:web", "group:runtime"]`
- `permissions.tools`: explicit allowlist patterns, overrides tool_groups
- `permissions.fs`: allowlisted path globs (read/write separate)
- `permissions.net`: allowlisted host globs + ports (optional)
- `permissions.secrets`: named secret scopes (no ambient access)
- Bind skill permissions to tool policy evaluation (skill execution context carries a policy profile).
- Add a “capability diff” display for installs/enables:
- “This skill requests: write access to `~/Documents/notes/**`, network access to `api.todoist.com:443`, tool `web.search`.”
- Tests:
- Skill with insufficient permissions cannot invoke denied tool or write outside allowed paths.
- Skill with allowed permissions succeeds.
**Acceptance**
- A skill cannot call a denied tool even if the model tries.
- A skill cannot read/write outside declared fs scopes.
---
### PR 2: Sandbox-by-Default Enforcement for High-Risk Tools
**Outcome:** Risky actions run inside isolation by default, with a clear escape hatch for trusted local workflows.
**Work items**
- Define a “risk tier” mapping for tool groups:
- `low`: pure compute / formatting
- `medium`: web fetch/search
- `high`: filesystem writes, shell/process, browser automation, credentialed API actions
- Enforce:
- `high` tools must execute inside the session sandbox (Docker) unless policy explicitly allows “host mode”.
- Add a per-session “execution environment” indicator in gateway/UI (host vs sandbox).
- Tests:
- Attempting `high` tool in host mode without explicit allow fails deterministically.
**Acceptance**
- By default, tools with meaningful blast radius do not run on the host.
---
### PR 3: Prompt Injection Firewall (Content Provenance + Tool Gating)
**Outcome:** Untrusted content can be used as *context* but not as *control*.
**Work items**
- Introduce provenance tags on content:
- `user_message` vs `fetched_content` vs `tool_output` vs `memory`
- Update prompts to:
- Explicitly forbid following instructions found inside `fetched_content`/`tool_output`.
- Require a short “tool intent” explanation before tool use when untrusted content is present.
- Add a “tool-call guard” layer:
- If the models tool call arguments contain obvious injection markers (“ignore previous”, “exfiltrate”, “send to”, etc.) and the request originated from untrusted content, force a confirmation step or deny.
- Reject tool calls that attempt to reference secrets directly in arguments.
- Tests:
- A web page that contains “run shell command X” does not trigger shell tool use.
- Model attempts to copy-paste secrets into a network request are blocked/redacted.
**Acceptance**
- Untrusted content cannot directly cause tool execution without an explicit, policy-consistent path.
---
### PR 4: Secret Scoping + Audit Logging (Operator-Grade)
**Outcome:** Secrets are scoped and auditable; tool usage is traceable without leaking secrets into logs.
**Work items**
- Add a “secret scope” mechanism for tools and skills:
- Tools declare `requiredSecretScopes[]`
- Skills declare `permissions.secrets[]`
- Resolution: only scoped secrets are provided to tool execution; no ambient env pass-through.
- Extend audit log events:
- tool start/end, policy decision, sandbox vs host, approval prompts, redactions applied
- stable ids for correlation per session
- Tests:
- Secrets never appear in logs (even on tool failure).
- Tool without declared scope cannot obtain secret.
**Acceptance**
- You can answer “what ran, with what permissions, in what environment” for any session.
---
### PR 5: Product Efficiency Layer (Minimal Surfaces, Max Habit)
**Outcome:** Make Flynn feel “always there” without adding unnecessary new attack surface.
**Work items**
- Tighten onboarding defaults:
- choose 2-3 “recommended surfaces” and ensure they are first-class (WebChat + one messaging channel).
- ensure pairing/auth flows are the default for unknown senders.
- Add “assistant ergonomics”:
- a small set of default automations that are safe (summaries, reminders) behind explicit opt-in.
- fast-path help for configuration gaps (already used for voice transcription, continue pattern).
- Tests:
- Setup wizard produces a config with safe defaults: pairing on, conservative tool profile, sandbox for high-risk.
**Acceptance**
- A new user can get to a useful assistant quickly, and default config does not expose dangerous capabilities.
## Sequencing
1. PR 1 (capabilities) and PR 2 (sandbox enforcement) first: they set the trust boundary.
2. PR 3 (injection firewall) next: it hardens the most common real-world failure mode.
3. PR 4 (secrets + audit) to make the system operable and trustworthy long-term.
4. PR 5 (efficiency layer) last: UX improvements on top of hardened foundations.
## Definition of Done (Milestone)
- Skills are capability-declared and enforced by policy.
- High-risk tools default to sandbox execution.
- Untrusted content is isolated and cannot drive tool execution without policy-compliant intent + gating.
- Secrets are scoped and never leak into logs.
- New-user path yields a safe, useful personal assistant in under ~10 minutes on WebChat.