# LLM Tool / Skill Best Practices **Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026) **Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI). --- ## 1. Universal Principles (apply to both) ### Tool design philosophy - **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools). - **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results. - **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context. ### Naming - Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`. - Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model. ### Descriptions (most impactful factor) - Write **3–4+ sentences** per tool. Include: - What it does - When to use it (and when NOT to) - What each parameter means + valid values - What the tool returns / does NOT return - Edge cases and caveats - Bad: `"Gets the stock price for a ticker."` - Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."` ### Input schema - Use JSON Schema with typed, described properties. - Mark truly required params as `required`; use `null` union types for optional fields. - Use `enum` to constrain values where possible — makes invalid states unrepresentable. - **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context. ### Tool responses - Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc. - Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param. - Implement pagination/filtering/truncation for responses that could bloat context. - On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes. - Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner. ### Safety fundamentals - Never give tools more permissions than needed (least privilege). - Treat all tool input as untrusted; validate before acting. - For destructive/irreversible actions: require explicit confirmation or human-in-the-loop. - Log tool calls for auditability. --- ## 2. Anthropic / Claude-specific **Sources:** - https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use - https://www.anthropic.com/engineering/writing-tools-for-agents ### Tool definition format ```json { "name": "get_stock_price", "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.", "input_schema": { "type": "object", "properties": { "ticker": { "type": "string", "description": "The stock ticker symbol, e.g. AAPL for Apple Inc." } }, "required": ["ticker"] }, "input_examples": [ { "ticker": "AAPL" }, { "ticker": "MSFT" } ] } ``` ### Model selection guidance | Scenario | Recommended model | |---|---| | Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) | | Simple, well-defined single tools | Claude Haiku (note: may infer missing params) | | Extended thinking + tools | See extended thinking guide — special rules apply | ### Claude-specific tips - Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly. - Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs). - **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools. - Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them. - Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses. - Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results"). - Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows. ### Evaluation approach (Anthropic recommended) 1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls. 2. Run eval agents with reasoning/CoT blocks enabled. 3. Collect metrics: accuracy, total tool calls, errors, token usage. 4. Feed transcripts back to Claude Code for automated tool improvement. 5. Use held-out test sets to avoid overfitting. --- ## 3. OpenAI / GPT-specific **Sources:** - https://platform.openai.com/docs/guides/function-calling - https://platform.openai.com/docs/guides/safety-best-practices ### Tool definition format ```json { "type": "function", "name": "get_weather", "description": "Retrieves current weather for the given location.", "strict": true, "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and country e.g. Bogotá, Colombia" }, "units": { "type": ["string", "null"], "enum": ["celsius", "fahrenheit"], "description": "Units the temperature will be returned in." } }, "required": ["location", "units"], "additionalProperties": false } } ``` ### OpenAI-specific tips - **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields). - **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries. - **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool). - **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache. - **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate. - **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`). - **Reasoning models (o-series):** Out of scope — not targeted by this workspace. - **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas. - **`additionalProperties: false`** is required for strict mode — always add it. - For code generation or high-stakes outputs: keep a human-in-the-loop review step. ### Safety (OpenAI-specific) - Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs. - Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns. - **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions"). - Constrain user input length to limit prompt injection surface. - Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions. --- ## 4. Differences Summary | Concern | Anthropic/Claude | OpenAI/GPT | |---|---|---| | Schema key | `input_schema` | `parameters` | | Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required | | Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array | | Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) | | Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable | | Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that | | Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 | | Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL | | Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` | | Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search | --- ## 5. Skill Safety Checklist (for ClawHub or custom skills) Before using any skill/tool — especially downloaded ones: - [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials) - [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag. - [ ] **Input validation:** Are tool inputs validated before any destructive or external action? - [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends) - [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data? - [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged? - [ ] **Description quality:** Is the description detailed enough that the model won't misuse it? - [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills) - [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals? --- ## 6. OpenClaw Skill Authoring Notes When writing new SKILL.md files or tools for zap: - **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why. - **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions. - **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models. - **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping. - **Token hygiene:** Keep responses bounded. Document expected response size in the skill. - **SKILL.md structure:** Document the target model family if the skill has model-specific behavior. --- *Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs*