docs: add safety docs and OpenClaw gap roadmap

2026-02-15 10:17:07 -08:00
parent 28304ac397
commit f2cdd1abd2
14 changed files with 3869 additions and 40 deletions
@@ -0,0 +1,240 @@
+# Safe-By-Default Personal Agent
+
+This document describes Flynn's "OpenClaw-style" safety boundary: how skills declare capabilities, how those capabilities are enforced at runtime, how high-risk execution is sandboxed by default, how prompt injection is mitigated, and what gets logged (without leaking secrets).
+
+If you're looking for API-level tool contracts, see `docs/api/TOOLS.md`.
+
+## Overview
+
+Flynn is built around a strict separation of:
+
+- **Conversation** (LLM output)
+- **Capabilities** (tools)
+- **Policy** (what tools are allowed, under what conditions)
+
+This milestone adds a skill capability layer and hardens the tool loop.
+
+Core principles:
+
+- Capability declarations beat intentions: skills get only what they declare.
+- Deny by default: a skill without a `permissions` manifest has no tool access.
+- Treat fetched/tool content as untrusted data, not instructions.
+- Never leak secrets into audit logs.
+
+## Skills: Capability Manifests
+
+Each skill lives in a directory with:
+
+- `SKILL.md` (instructions injected into the system prompt)
+- `manifest.json` (metadata + optional capabilities)
+
+The capability declaration is `manifest.json.permissions`.
+
+See: `src/skills/types.ts`.
+
+### `permissions` Schema (manifest.json)
+
+```json
+{
+  "permissions": {
+    "tool_groups": ["group:web", "group:memory"],
+    "tools": ["web.fetch", "web.search"],
+    "fs": {
+      "read": ["/home/will/Documents/**"],
+      "write": ["/home/will/Documents/notes/**"]
+    },
+    "net": [
+      { "host": "api.todoist.com", "ports": [443] },
+      { "host": "*.github.com", "ports": [443] }
+    ],
+    "secrets": ["gmail", "web_search"],
+    "execution_environment": "sandbox"
+  }
+}
+```
+
+Fields:
+
+- `tool_groups`: tool-group allowlist using names from `src/tools/policy.ts` (`group:web`, `group:fs`, etc.)
+- `tools`: explicit tool-name/pattern allowlist (glob). If present, it overrides `tool_groups`.
+- `fs.read` / `fs.write`: allowed path globs (checked for `file.*` tools).
+- `net`: allowed hosts (glob) and optional port list (best-effort enforcement for `web.fetch`).
+- `secrets`: secret scopes allowed for this skill (used to gate credentialed tools).
+- `execution_environment`: `sandbox` (default) or `host` (escape hatch for high-risk operations).
+
+### Backward Compatibility
+
+Skills without `permissions` still load, but:
+
+- If a skill is activated (via routing) and it has no `permissions` block, **it has no tool access**.
+- This is deliberate: skills should be auditable capability packages.
+
+## Runtime Enforcement
+
+Enforcement happens in two places:
+
+1. **Tool listing / exposure** (ToolPolicy)
+2. **Tool execution** (ToolExecutor) — defense in depth
+
+### ToolPolicy: Restricting Available Tools
+
+When a skill context is active, the tool allow set is intersected with the skill's declared allowlist.
+
+See: `src/tools/policy.ts`.
+
+Important behaviors:
+
+- If `skillName` is set but `skillPermissions` is missing, ToolPolicy returns an empty allowed set.
+- If `permissions.tools` is present, it overrides `permissions.tool_groups`.
+
+### ToolExecutor: Enforcing Paths, Network, Secrets, and Injection Guards
+
+See: `src/tools/executor.ts`.
+
+When a skill context is active (`ToolPolicyContext.skillName`):
+
+- Filesystem writes are blocked outside `permissions.fs.write`.
+- Filesystem reads are blocked outside `permissions.fs.read` (for `file.read`/`file.list`).
+- Credentialed tools require their `requiredSecretScopes` be present in the skill's allowed scopes.
+- If untrusted content has been seen, obviously malicious argument markers can block high-risk tool calls.
+
+## Skill Routing (Intents)
+
+Skills can be activated via intent rules.
+
+See:
+
+- Config schema: `src/config/schema.ts` (`intents.rules[].target.type = 'skill'`)
+- Routing: `src/daemon/routing.ts`
+
+Example config:
+
+```yaml
+intents:
+  enabled: true
+  match_threshold: 0.7
+  rules:
+    - name: "web-research"
+      patterns: ["research *", "look up *"]
+      target: { type: skill, name: my-web-skill }
+      enabled: true
+```
+
+When an intent routes to a skill:
+
+- `toolPolicyContext.skillName` and `toolPolicyContext.skillPermissions` are set
+- High-risk execution defaults to sandbox (when available)
+
+## Sandbox-By-Default (High-Risk Tools)
+
+In skill context, high-risk tools are not allowed to run on the host unless the skill explicitly opts in.
+
+High-risk tools include:
+
+- `shell.exec`
+- `process.start`
+- `process.kill`
+- `file.write`, `file.edit`, `file.patch`
+- all `browser.*`
+
+Behavior:
+
+- Default (`execution_environment` omitted or `sandbox`):
+  - If Docker sandbox is enabled and available, `shell.exec` and `process.start` run inside the per-session sandbox container.
+  - If sandbox is not available, host execution for high-risk tools is denied for skill contexts.
+- Escape hatch (`execution_environment: host`): high-risk tools are permitted to run on host (still subject to tool policy + hooks/autonomy).
+
+Note: today, only `shell.exec` and `process.start` are replaced with sandboxed implementations. Other high-risk tools are blocked-by-default in skill contexts unless host mode is explicitly allowed.
+
+## Prompt Injection Mitigation
+
+Flynn uses a practical defense-in-depth approach:
+
+1. System prompt guidance: fetched/tool content is treated as untrusted data.
+2. Provenance tagging: tool results are wrapped in provenance markers.
+3. Tool-call guard: when untrusted content has been observed, tool calls with obvious injection markers are blocked.
+
+### Provenance Wrapping
+
+Tool results returned to the model are wrapped like:
+
+```text
+[provenance=fetched_content tool=web.fetch untrusted=true]
+...tool output...
+[/provenance]
+```
+
+See: `src/backends/native/agent.ts`.
+
+### Tool-Call Guard
+
+When `ToolPolicyContext.untrustedContent` is true:
+
+- High-risk tool calls whose args contain obvious markers (e.g. `rm -rf`, `ignore previous`, `exfiltrate`, etc.) are blocked.
+- Network tools (`web.fetch`, `web.search`) refuse arguments containing secret-like fields.
+
+See: `src/tools/executor.ts`.
+
+## Secret Scopes
+
+Tools can declare which secret scopes they require:
+
+- `Tool.requiredSecretScopes?: string[]`
+
+Skills declare which scopes they are allowed to use:
+
+- `manifest.json.permissions.secrets?: string[]`
+
+Enforcement:
+
+- In skill context, if a tool requires scopes not allowed by the skill, ToolExecutor denies the tool.
+- Outside skill context, secrets are treated as "ambient" (allowed) to preserve backward compatibility.
+
+See:
+
+- `src/tools/types.ts`
+- `src/tools/executor.ts`
+- Examples: `src/tools/builtin/gmail.ts`, `src/tools/builtin/gcal.ts`, `src/tools/builtin/web-search.ts`
+
+## Audit Logging (Without Secret Leaks)
+
+Tool execution is audited, but sensitive values are redacted before writing to disk.
+
+See:
+
+- `src/audit/logger.ts`
+- `src/audit/types.ts`
+- `src/audit/redact.ts`
+
+Notable fields:
+
+- `execution_id`: a per-tool-call UUID for correlation
+- `execution_environment`: `host` or `sandbox`
+- `skill_name`: active skill (if any)
+- `redactions_applied`: count of redaction operations
+- `tool.approval`: emitted when a confirm hook is resolved
+
+Example tool start event (JSONL):
+
+```json
+{
+  "timestamp": 0,
+  "level": "debug",
+  "event_type": "tool.start",
+  "event": {
+    "tool_name": "shell.exec",
+    "execution_id": "...",
+    "execution_environment": "sandbox",
+    "skill_name": "my-web-skill",
+    "redactions_applied": 1,
+    "tool_args": { "command": "echo [REDACTED_TOKEN]" }
+  }
+}
+```
+
+## Recommended Operator Defaults
+
+- Enable Docker sandboxing (`sandbox.enabled: true`).
+- Enable DM pairing (`pairing.enabled: true`) on any messaging surface.
+- Use a conservative tool profile for general chat (`tools.profile: messaging`).
+- Use skill intent routing for specialized workflows and keep skill permissions narrow.