feat(skills): add llm-tool-best-practices reference from Anthropic + OpenAI official docs

2026-03-05 20:44:00 +00:00
parent 23782735a1
commit a34826908d
2 changed files with 229 additions and 0 deletions
@@ -0,0 +1,201 @@
+# LLM Tool / Skill Best Practices
+**Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026)
+**Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).
+
+---
+
+## 1. Universal Principles (apply to both)
+
+### Tool design philosophy
+- **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools).
+- **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
+- **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context.
+
+### Naming
+- Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`.
+- Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.
+
+### Descriptions (most impactful factor)
+- Write **3–4+ sentences** per tool. Include:
+  - What it does
+  - When to use it (and when NOT to)
+  - What each parameter means + valid values
+  - What the tool returns / does NOT return
+  - Edge cases and caveats
+- Bad: `"Gets the stock price for a ticker."`
+- Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."`
+
+### Input schema
+- Use JSON Schema with typed, described properties.
+- Mark truly required params as `required`; use `null` union types for optional fields.
+- Use `enum` to constrain values where possible — makes invalid states unrepresentable.
+- **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.
+
+### Tool responses
+- Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc.
+- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param.
+- Implement pagination/filtering/truncation for responses that could bloat context.
+- On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes.
+- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.
+
+### Safety fundamentals
+- Never give tools more permissions than needed (least privilege).
+- Treat all tool input as untrusted; validate before acting.
+- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
+- Log tool calls for auditability.
+
+---
+
+## 2. Anthropic / Claude-specific
+
+**Sources:**
+- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
+- https://www.anthropic.com/engineering/writing-tools-for-agents
+
+### Tool definition format
+```json
+{
+  "name": "get_stock_price",
+  "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
+  "input_schema": {
+    "type": "object",
+    "properties": {
+      "ticker": {
+        "type": "string",
+        "description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
+      }
+    },
+    "required": ["ticker"]
+  },
+  "input_examples": [
+    { "ticker": "AAPL" },
+    { "ticker": "MSFT" }
+  ]
+}
+```
+
+### Model selection guidance
+| Scenario | Recommended model |
+|---|---|
+| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) |
+| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) |
+| Extended thinking + tools | See extended thinking guide — special rules apply |
+
+### Claude-specific tips
+- Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
+- Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs).
+- **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools.
+- Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them.
+- Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
+- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
+- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.
+
+### Evaluation approach (Anthropic recommended)
+1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
+2. Run eval agents with reasoning/CoT blocks enabled.
+3. Collect metrics: accuracy, total tool calls, errors, token usage.
+4. Feed transcripts back to Claude Code for automated tool improvement.
+5. Use held-out test sets to avoid overfitting.
+
+---
+
+## 3. OpenAI / GPT-specific
+
+**Sources:**
+- https://platform.openai.com/docs/guides/function-calling
+- https://platform.openai.com/docs/guides/safety-best-practices
+
+### Tool definition format
+```json
+{
+  "type": "function",
+  "name": "get_weather",
+  "description": "Retrieves current weather for the given location.",
+  "strict": true,
+  "parameters": {
+    "type": "object",
+    "properties": {
+      "location": {
+        "type": "string",
+        "description": "City and country e.g. Bogotá, Colombia"
+      },
+      "units": {
+        "type": ["string", "null"],
+        "enum": ["celsius", "fahrenheit"],
+        "description": "Units the temperature will be returned in."
+      }
+    },
+    "required": ["location", "units"],
+    "additionalProperties": false
+  }
+}
+```
+
+### OpenAI-specific tips
+- **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields).
+- **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries.
+- **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool).
+- **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache.
+- **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
+- **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`).
+- **Reasoning models (o4-mini, GPT-5, etc.):** Do NOT add examples in descriptions — this can hurt reasoning model performance. Pass reasoning items back with tool call outputs (required).
+- **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
+- **`additionalProperties: false`** is required for strict mode — always add it.
+- For code generation or high-stakes outputs: keep a human-in-the-loop review step.
+
+### Safety (OpenAI-specific)
+- Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs.
+- Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns.
+- **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
+- Constrain user input length to limit prompt injection surface.
+- Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions.
+
+---
+
+## 4. Differences Summary
+
+| Concern | Anthropic/Claude | OpenAI/GPT |
+|---|---|---|
+| Schema key | `input_schema` | `parameters` |
+| Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required |
+| Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array |
+| Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) |
+| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable |
+| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that |
+| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 (not nano for parallel calls) |
+| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL |
+| Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` |
+| Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search |
+
+---
+
+## 5. Skill Safety Checklist (for ClawHub or custom skills)
+
+Before using any skill/tool — especially downloaded ones:
+
+- [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials)
+- [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag.
+- [ ] **Input validation:** Are tool inputs validated before any destructive or external action?
+- [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends)
+- [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data?
+- [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged?
+- [ ] **Description quality:** Is the description detailed enough that the model won't misuse it?
+- [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
+- [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals?
+
+---
+
+## 6. OpenClaw Skill Authoring Notes
+
+When writing new SKILL.md files or tools for zap:
+
+- **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why.
+- **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions.
+- **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models.
+- **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
+- **Token hygiene:** Keep responses bounded. Document expected response size in the skill.
+- **SKILL.md structure:** Document the target model family if the skill has model-specific behavior.
+
+---
+
+*Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs*
@@ -0,0 +1,28 @@
+# SKILL.md — llm-tool-best-practices
+
+## Purpose
+Reference guide synthesizing official Anthropic and OpenAI best practices for building, reviewing, and importing agent skills/tools. Use this when:
+- Writing a new skill (SKILL.md or tool definition)
+- Reviewing a downloaded ClawHub skill for safety
+- Debugging why a tool is being misused or ignored by the model
+- Optimizing a skill for a specific LLM (Claude vs. GPT)
+
+## How to use
+Read `LLM_TOOL_BEST_PRACTICES.md` in this directory for the full reference.
+
+Key sections:
+1. **Universal principles** — applies to all LLMs
+2. **Anthropic/Claude-specific** — schema format, model selection, `input_examples`, eval approach
+3. **OpenAI/GPT-specific** — `strict` mode, tool count limits, reasoning model caveats, safety APIs
+4. **Differences summary** — quick comparison table
+5. **Skill safety checklist** — use before activating any downloaded skill
+6. **OpenClaw authoring notes** — how to write SKILL.md files for this workspace
+
+## Sources
+- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
+- https://www.anthropic.com/engineering/writing-tools-for-agents
+- https://platform.openai.com/docs/guides/function-calling
+- https://platform.openai.com/docs/guides/safety-best-practices
+
+## Last updated
+2026-03-05