11 KiB
11 KiB
LLM Tool / Skill Best Practices
Sources: Anthropic official docs + engineering blog · OpenAI official docs (March 2026) Purpose: Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).
1. Universal Principles (apply to both)
Tool design philosophy
- Fewer, richer tools beat many narrow ones. Don't wrap every API endpoint 1:1. Group related operations into a single tool with an
actionparameter (e.g. onegithub_prtool withcreate/review/mergerather than three separate tools). - Design for agents, not developers. Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
- Each tool needs a clear, distinct purpose. Overlapping tools confuse model selection and waste context.
Naming
- Use
service_resource_verbnamespacing:github_issues_search,slack_channel_send. - Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.
Descriptions (most impactful factor)
- Write 3–4+ sentences per tool. Include:
- What it does
- When to use it (and when NOT to)
- What each parameter means + valid values
- What the tool returns / does NOT return
- Edge cases and caveats
- Bad:
"Gets the stock price for a ticker." - Good:
"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."
Input schema
- Use JSON Schema with typed, described properties.
- Mark truly required params as
required; usenullunion types for optional fields. - Use
enumto constrain values where possible — makes invalid states unrepresentable. - Pass the intern test: could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.
Tool responses
- Return only high-signal information. Strip opaque internal IDs, blob URLs, mime types, etc.
- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a
response_format: "concise" | "detailed"param. - Implement pagination/filtering/truncation for responses that could bloat context.
- On errors: return actionable, plain-language error messages — not raw stack traces or error codes.
- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.
Safety fundamentals
- Never give tools more permissions than needed (least privilege).
- Treat all tool input as untrusted; validate before acting.
- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
- Log tool calls for auditability.
2. Anthropic / Claude-specific
Sources:
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
- https://www.anthropic.com/engineering/writing-tools-for-agents
Tool definition format
{
"name": "get_stock_price",
"description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
}
},
"required": ["ticker"]
},
"input_examples": [
{ "ticker": "AAPL" },
{ "ticker": "MSFT" }
]
}
Model selection guidance
| Scenario | Recommended model |
|---|---|
| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) |
| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) |
| Extended thinking + tools | See extended thinking guide — special rules apply |
Claude-specific tips
- Use
input_examplesfield for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly. - Use
strict: truein tool definitions for production agents (guarantees schema conformance via structured outputs). - Interleaved thinking: Enable for evals to understand why Claude picks (or skips) certain tools.
- Consolidate tools that are always called in sequence: if
get_weatheris always followed byformat_weather_card, merge them. - Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.
Evaluation approach (Anthropic recommended)
- Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
- Run eval agents with reasoning/CoT blocks enabled.
- Collect metrics: accuracy, total tool calls, errors, token usage.
- Feed transcripts back to Claude Code for automated tool improvement.
- Use held-out test sets to avoid overfitting.
3. OpenAI / GPT-specific
Sources:
- https://platform.openai.com/docs/guides/function-calling
- https://platform.openai.com/docs/guides/safety-best-practices
Tool definition format
{
"type": "function",
"name": "get_weather",
"description": "Retrieves current weather for the given location.",
"strict": true,
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
},
"units": {
"type": ["string", "null"],
"enum": ["celsius", "fahrenheit"],
"description": "Units the temperature will be returned in."
}
},
"required": ["location", "units"],
"additionalProperties": false
}
}
OpenAI-specific tips
- Always enable
strict: true— guarantees function calls match schema exactly (uses structured outputs under the hood). RequiresadditionalProperties: falseand all fields inrequired(usenulltype for optional fields). - Keep initially available functions under ~20 — performance degrades with too many tools in context. Use tool search for large tool libraries.
parallel_tool_calls: enabled by default. Disable if your tools have ordering dependencies, or if usinggpt-4.1-nano(known to double-call same tool).tool_choice: Default isauto. Userequiredto force a call, or named function to force a specific one. Useallowed_toolsto restrict available tools per turn without breaking prompt cache.- Tool search (
defer_loading: true): For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate. - Namespaces: Group related tools using the
namespacetype with adescription. Helps model choose between overlapping domains (e.g.,crmvsbilling). - Reasoning models (o-series): Out of scope — not targeted by this workspace.
- Token awareness: Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
additionalProperties: falseis required for strict mode — always add it.- For code generation or high-stakes outputs: keep a human-in-the-loop review step.
Safety (OpenAI-specific)
- Use the Moderation API (free) to filter unsafe content in tool outputs or user inputs.
- Pass
safety_identifier(hashed user ID) with API requests to help OpenAI detect abuse patterns. - Adversarial testing / red-teaming: test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
- Constrain user input length to limit prompt injection surface.
- Implement HITL (human in the loop) for high-stakes or irreversible tool actions.
4. Differences Summary
| Concern | Anthropic/Claude | OpenAI/GPT |
|---|---|---|
| Schema key | input_schema |
parameters |
| Strict mode | strict: true (structured outputs) |
strict: true + additionalProperties: false required |
| Optional fields | Standard JSON Schema nullable | Add null to type union + keep in required array |
| Examples in tool def | input_examples array (recommended) |
Inline in description (avoid for reasoning models) |
| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable |
| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that |
| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 |
| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL |
| Parallel calls | Default on; managed by model | Default on; disable with parallel_tool_calls: false |
| Deferred tools | N/A (use fewer tools instead) | defer_loading: true + tool search |
5. Skill Safety Checklist (for ClawHub or custom skills)
Before using any skill/tool — especially downloaded ones:
- Least privilege: Does the skill need the permissions it requests? (file system, network, exec, credentials)
- No remote instruction execution: Does the skill fetch remote content and act on it as instructions? Red flag.
- Input validation: Are tool inputs validated before any destructive or external action?
- Reversibility: Are destructive operations guarded? (prefer
trashoverrm, confirmations before sends) - Output scope: Does the tool return only what's needed, or does it leak sensitive data?
- Credential handling: Are secrets passed as env vars, not hardcoded or logged?
- Description quality: Is the description detailed enough that the model won't misuse it?
- Source trust: Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
- Error handling: Does the skill return helpful errors, or raw stack traces that could leak internals?
6. OpenClaw Skill Authoring Notes
When writing new SKILL.md files or tools for zap:
- Model hint in SKILL.md: If a skill is optimized for a specific model, document which model and why.
- Anthropic skills: Provide
input_examplesfor complex tools; use detailed multi-sentence descriptions. - OpenAI skills: Enable
strict: true+additionalProperties: false; keep tool count small; no examples for o-series reasoning models. - Cross-model skills: Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
- Token hygiene: Keep responses bounded. Document expected response size in the skill.
- SKILL.md structure: Document the target model family if the skill has model-specific behavior.
Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs