Files
swarm-zap/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md

11 KiB
Raw Blame History

LLM Tool / Skill Best Practices

Sources: Anthropic official docs + engineering blog · OpenAI official docs (March 2026) Purpose: Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).


1. Universal Principles (apply to both)

Tool design philosophy

  • Fewer, richer tools beat many narrow ones. Don't wrap every API endpoint 1:1. Group related operations into a single tool with an action parameter (e.g. one github_pr tool with create/review/merge rather than three separate tools).
  • Design for agents, not developers. Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
  • Each tool needs a clear, distinct purpose. Overlapping tools confuse model selection and waste context.

Naming

  • Use service_resource_verb namespacing: github_issues_search, slack_channel_send.
  • Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.

Descriptions (most impactful factor)

  • Write 34+ sentences per tool. Include:
    • What it does
    • When to use it (and when NOT to)
    • What each parameter means + valid values
    • What the tool returns / does NOT return
    • Edge cases and caveats
  • Bad: "Gets the stock price for a ticker."
  • Good: "Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."

Input schema

  • Use JSON Schema with typed, described properties.
  • Mark truly required params as required; use null union types for optional fields.
  • Use enum to constrain values where possible — makes invalid states unrepresentable.
  • Pass the intern test: could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.

Tool responses

  • Return only high-signal information. Strip opaque internal IDs, blob URLs, mime types, etc.
  • Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a response_format: "concise" | "detailed" param.
  • Implement pagination/filtering/truncation for responses that could bloat context.
  • On errors: return actionable, plain-language error messages — not raw stack traces or error codes.
  • Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.

Safety fundamentals

  • Never give tools more permissions than needed (least privilege).
  • Treat all tool input as untrusted; validate before acting.
  • For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
  • Log tool calls for auditability.

2. Anthropic / Claude-specific

Sources:

Tool definition format

{
  "name": "get_stock_price",
  "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
  "input_schema": {
    "type": "object",
    "properties": {
      "ticker": {
        "type": "string",
        "description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
      }
    },
    "required": ["ticker"]
  },
  "input_examples": [
    { "ticker": "AAPL" },
    { "ticker": "MSFT" }
  ]
}

Model selection guidance

Scenario Recommended model
Complex multi-tool tasks, ambiguous queries Claude Opus (latest)
Simple, well-defined single tools Claude Haiku (note: may infer missing params)
Extended thinking + tools See extended thinking guide — special rules apply

Claude-specific tips

  • Use input_examples field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
  • Use strict: true in tool definitions for production agents (guarantees schema conformance via structured outputs).
  • Interleaved thinking: Enable for evals to understand why Claude picks (or skips) certain tools.
  • Consolidate tools that are always called in sequence: if get_weather is always followed by format_weather_card, merge them.
  • Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
  • Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
  • Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.
  1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
  2. Run eval agents with reasoning/CoT blocks enabled.
  3. Collect metrics: accuracy, total tool calls, errors, token usage.
  4. Feed transcripts back to Claude Code for automated tool improvement.
  5. Use held-out test sets to avoid overfitting.

3. OpenAI / GPT-specific

Sources:

Tool definition format

{
  "type": "function",
  "name": "get_weather",
  "description": "Retrieves current weather for the given location.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country e.g. Bogotá, Colombia"
      },
      "units": {
        "type": ["string", "null"],
        "enum": ["celsius", "fahrenheit"],
        "description": "Units the temperature will be returned in."
      }
    },
    "required": ["location", "units"],
    "additionalProperties": false
  }
}

OpenAI-specific tips

  • Always enable strict: true — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires additionalProperties: false and all fields in required (use null type for optional fields).
  • Keep initially available functions under ~20 — performance degrades with too many tools in context. Use tool search for large tool libraries.
  • parallel_tool_calls: enabled by default. Disable if your tools have ordering dependencies, or if using gpt-4.1-nano (known to double-call same tool).
  • tool_choice: Default is auto. Use required to force a call, or named function to force a specific one. Use allowed_tools to restrict available tools per turn without breaking prompt cache.
  • Tool search (defer_loading: true): For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
  • Namespaces: Group related tools using the namespace type with a description. Helps model choose between overlapping domains (e.g., crm vs billing).
  • Reasoning models (o-series): Out of scope — not targeted by this workspace.
  • Token awareness: Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
  • additionalProperties: false is required for strict mode — always add it.
  • For code generation or high-stakes outputs: keep a human-in-the-loop review step.

Safety (OpenAI-specific)

  • Use the Moderation API (free) to filter unsafe content in tool outputs or user inputs.
  • Pass safety_identifier (hashed user ID) with API requests to help OpenAI detect abuse patterns.
  • Adversarial testing / red-teaming: test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
  • Constrain user input length to limit prompt injection surface.
  • Implement HITL (human in the loop) for high-stakes or irreversible tool actions.

4. Differences Summary

Concern Anthropic/Claude OpenAI/GPT
Schema key input_schema parameters
Strict mode strict: true (structured outputs) strict: true + additionalProperties: false required
Optional fields Standard JSON Schema nullable Add null to type union + keep in required array
Examples in tool def input_examples array (recommended) Inline in description (avoid for reasoning models)
Tool response format Flexible; test JSON/XML/Markdown String preferred; JSON/error codes acceptable
Max tools in context No hard limit; keep focused Soft limit ~20; use tool search beyond that
Best model for complex tools Claude Opus GPT-4.1 / GPT-5
Safety tooling Built-in thinking/CoT, strict tool use Moderation API, safety_identifier, HITL
Parallel calls Default on; managed by model Default on; disable with parallel_tool_calls: false
Deferred tools N/A (use fewer tools instead) defer_loading: true + tool search

5. Skill Safety Checklist (for ClawHub or custom skills)

Before using any skill/tool — especially downloaded ones:

  • Least privilege: Does the skill need the permissions it requests? (file system, network, exec, credentials)
  • No remote instruction execution: Does the skill fetch remote content and act on it as instructions? Red flag.
  • Input validation: Are tool inputs validated before any destructive or external action?
  • Reversibility: Are destructive operations guarded? (prefer trash over rm, confirmations before sends)
  • Output scope: Does the tool return only what's needed, or does it leak sensitive data?
  • Credential handling: Are secrets passed as env vars, not hardcoded or logged?
  • Description quality: Is the description detailed enough that the model won't misuse it?
  • Source trust: Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
  • Error handling: Does the skill return helpful errors, or raw stack traces that could leak internals?

6. OpenClaw Skill Authoring Notes

When writing new SKILL.md files or tools for zap:

  • Model hint in SKILL.md: If a skill is optimized for a specific model, document which model and why.
  • Anthropic skills: Provide input_examples for complex tools; use detailed multi-sentence descriptions.
  • OpenAI skills: Enable strict: true + additionalProperties: false; keep tool count small; no examples for o-series reasoning models.
  • Cross-model skills: Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
  • Token hygiene: Keep responses bounded. Document expected response size in the skill.
  • SKILL.md structure: Document the target model family if the skill has model-specific behavior.

Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs