From a34826908df3018a3846d14227580669bba58bd8 Mon Sep 17 00:00:00 2001 From: zap Date: Thu, 5 Mar 2026 20:44:00 +0000 Subject: [PATCH] feat(skills): add llm-tool-best-practices reference from Anthropic + OpenAI official docs --- .../LLM_TOOL_BEST_PRACTICES.md | 201 ++++++++++++++++++ skills/llm-tool-best-practices/SKILL.md | 28 +++ 2 files changed, 229 insertions(+) create mode 100644 skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md create mode 100644 skills/llm-tool-best-practices/SKILL.md diff --git a/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md b/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md new file mode 100644 index 0000000..86acf6c --- /dev/null +++ b/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md @@ -0,0 +1,201 @@ +# LLM Tool / Skill Best Practices +**Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026) +**Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI). + +--- + +## 1. Universal Principles (apply to both) + +### Tool design philosophy +- **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools). +- **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results. +- **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context. + +### Naming +- Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`. +- Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model. + +### Descriptions (most impactful factor) +- Write **3–4+ sentences** per tool. Include: + - What it does + - When to use it (and when NOT to) + - What each parameter means + valid values + - What the tool returns / does NOT return + - Edge cases and caveats +- Bad: `"Gets the stock price for a ticker."` +- Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."` + +### Input schema +- Use JSON Schema with typed, described properties. +- Mark truly required params as `required`; use `null` union types for optional fields. +- Use `enum` to constrain values where possible — makes invalid states unrepresentable. +- **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context. + +### Tool responses +- Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc. +- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param. +- Implement pagination/filtering/truncation for responses that could bloat context. +- On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes. +- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner. + +### Safety fundamentals +- Never give tools more permissions than needed (least privilege). +- Treat all tool input as untrusted; validate before acting. +- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop. +- Log tool calls for auditability. + +--- + +## 2. Anthropic / Claude-specific + +**Sources:** +- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use +- https://www.anthropic.com/engineering/writing-tools-for-agents + +### Tool definition format +```json +{ + "name": "get_stock_price", + "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.", + "input_schema": { + "type": "object", + "properties": { + "ticker": { + "type": "string", + "description": "The stock ticker symbol, e.g. AAPL for Apple Inc." + } + }, + "required": ["ticker"] + }, + "input_examples": [ + { "ticker": "AAPL" }, + { "ticker": "MSFT" } + ] +} +``` + +### Model selection guidance +| Scenario | Recommended model | +|---|---| +| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) | +| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) | +| Extended thinking + tools | See extended thinking guide — special rules apply | + +### Claude-specific tips +- Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly. +- Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs). +- **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools. +- Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them. +- Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses. +- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results"). +- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows. + +### Evaluation approach (Anthropic recommended) +1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls. +2. Run eval agents with reasoning/CoT blocks enabled. +3. Collect metrics: accuracy, total tool calls, errors, token usage. +4. Feed transcripts back to Claude Code for automated tool improvement. +5. Use held-out test sets to avoid overfitting. + +--- + +## 3. OpenAI / GPT-specific + +**Sources:** +- https://platform.openai.com/docs/guides/function-calling +- https://platform.openai.com/docs/guides/safety-best-practices + +### Tool definition format +```json +{ + "type": "function", + "name": "get_weather", + "description": "Retrieves current weather for the given location.", + "strict": true, + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "City and country e.g. Bogotá, Colombia" + }, + "units": { + "type": ["string", "null"], + "enum": ["celsius", "fahrenheit"], + "description": "Units the temperature will be returned in." + } + }, + "required": ["location", "units"], + "additionalProperties": false + } +} +``` + +### OpenAI-specific tips +- **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields). +- **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries. +- **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool). +- **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache. +- **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate. +- **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`). +- **Reasoning models (o4-mini, GPT-5, etc.):** Do NOT add examples in descriptions — this can hurt reasoning model performance. Pass reasoning items back with tool call outputs (required). +- **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas. +- **`additionalProperties: false`** is required for strict mode — always add it. +- For code generation or high-stakes outputs: keep a human-in-the-loop review step. + +### Safety (OpenAI-specific) +- Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs. +- Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns. +- **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions"). +- Constrain user input length to limit prompt injection surface. +- Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions. + +--- + +## 4. Differences Summary + +| Concern | Anthropic/Claude | OpenAI/GPT | +|---|---|---| +| Schema key | `input_schema` | `parameters` | +| Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required | +| Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array | +| Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) | +| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable | +| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that | +| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 (not nano for parallel calls) | +| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL | +| Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` | +| Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search | + +--- + +## 5. Skill Safety Checklist (for ClawHub or custom skills) + +Before using any skill/tool — especially downloaded ones: + +- [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials) +- [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag. +- [ ] **Input validation:** Are tool inputs validated before any destructive or external action? +- [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends) +- [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data? +- [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged? +- [ ] **Description quality:** Is the description detailed enough that the model won't misuse it? +- [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills) +- [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals? + +--- + +## 6. OpenClaw Skill Authoring Notes + +When writing new SKILL.md files or tools for zap: + +- **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why. +- **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions. +- **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models. +- **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping. +- **Token hygiene:** Keep responses bounded. Document expected response size in the skill. +- **SKILL.md structure:** Document the target model family if the skill has model-specific behavior. + +--- + +*Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs* diff --git a/skills/llm-tool-best-practices/SKILL.md b/skills/llm-tool-best-practices/SKILL.md new file mode 100644 index 0000000..acefe0f --- /dev/null +++ b/skills/llm-tool-best-practices/SKILL.md @@ -0,0 +1,28 @@ +# SKILL.md — llm-tool-best-practices + +## Purpose +Reference guide synthesizing official Anthropic and OpenAI best practices for building, reviewing, and importing agent skills/tools. Use this when: +- Writing a new skill (SKILL.md or tool definition) +- Reviewing a downloaded ClawHub skill for safety +- Debugging why a tool is being misused or ignored by the model +- Optimizing a skill for a specific LLM (Claude vs. GPT) + +## How to use +Read `LLM_TOOL_BEST_PRACTICES.md` in this directory for the full reference. + +Key sections: +1. **Universal principles** — applies to all LLMs +2. **Anthropic/Claude-specific** — schema format, model selection, `input_examples`, eval approach +3. **OpenAI/GPT-specific** — `strict` mode, tool count limits, reasoning model caveats, safety APIs +4. **Differences summary** — quick comparison table +5. **Skill safety checklist** — use before activating any downloaded skill +6. **OpenClaw authoring notes** — how to write SKILL.md files for this workspace + +## Sources +- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use +- https://www.anthropic.com/engineering/writing-tools-for-agents +- https://platform.openai.com/docs/guides/function-calling +- https://platform.openai.com/docs/guides/safety-best-practices + +## Last updated +2026-03-05