Files

zap 06202bcfbe chore(skills): drop OpenAI o-series from llm-tool-best-practices scope

2026-03-05 20:44:57 +00:00

11 KiB

Raw Blame History

LLM Tool / Skill Best Practices

Sources: Anthropic official docs + engineering blog · OpenAI official docs (March 2026) Purpose: Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).

1. Universal Principles (apply to both)

Tool design philosophy

Fewer, richer tools beat many narrow ones. Don't wrap every API endpoint 1:1. Group related operations into a single tool with an action parameter (e.g. one github_pr tool with create/review/merge rather than three separate tools).
Design for agents, not developers. Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
Each tool needs a clear, distinct purpose. Overlapping tools confuse model selection and waste context.

Naming

Use service_resource_verb namespacing: github_issues_search, slack_channel_send.
Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.

Descriptions (most impactful factor)

Write 3–4+ sentences per tool. Include:
- What it does
- When to use it (and when NOT to)
- What each parameter means + valid values
- What the tool returns / does NOT return
- Edge cases and caveats
Bad: "Gets the stock price for a ticker."
Good: "Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."

Input schema

Use JSON Schema with typed, described properties.
Mark truly required params as required; use null union types for optional fields.
Use enum to constrain values where possible — makes invalid states unrepresentable.
Pass the intern test: could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.

Tool responses

Return only high-signal information. Strip opaque internal IDs, blob URLs, mime types, etc.
Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a response_format: "concise" | "detailed" param.
Implement pagination/filtering/truncation for responses that could bloat context.
On errors: return actionable, plain-language error messages — not raw stack traces or error codes.
Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.

Safety fundamentals

Never give tools more permissions than needed (least privilege).
Treat all tool input as untrusted; validate before acting.
For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
Log tool calls for auditability.

2. Anthropic / Claude-specific

Sources:

Tool definition format

{
  "name": "get_stock_price",
  "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
  "input_schema": {
    "type": "object",
    "properties": {
      "ticker": {
        "type": "string",
        "description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
      }
    },
    "required": ["ticker"]
  },
  "input_examples": [
    { "ticker": "AAPL" },
    { "ticker": "MSFT" }
  ]
}

Model selection guidance

Scenario	Recommended model
Complex multi-tool tasks, ambiguous queries	Claude Opus (latest)
Simple, well-defined single tools	Claude Haiku (note: may infer missing params)
Extended thinking + tools	See extended thinking guide — special rules apply

Claude-specific tips

Use input_examples field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
Use strict: true in tool definitions for production agents (guarantees schema conformance via structured outputs).
Interleaved thinking: Enable for evals to understand why Claude picks (or skips) certain tools.
Consolidate tools that are always called in sequence: if get_weather is always followed by format_weather_card, merge them.
Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.

Evaluation approach (Anthropic recommended)

Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
Run eval agents with reasoning/CoT blocks enabled.
Collect metrics: accuracy, total tool calls, errors, token usage.
Feed transcripts back to Claude Code for automated tool improvement.
Use held-out test sets to avoid overfitting.

3. OpenAI / GPT-specific

Sources:

Tool definition format

{
  "type": "function",
  "name": "get_weather",
  "description": "Retrieves current weather for the given location.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country e.g. Bogotá, Colombia"
      },
      "units": {
        "type": ["string", "null"],
        "enum": ["celsius", "fahrenheit"],
        "description": "Units the temperature will be returned in."
      }
    },
    "required": ["location", "units"],
    "additionalProperties": false
  }
}

OpenAI-specific tips

Always enable strict: true — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires additionalProperties: false and all fields in required (use null type for optional fields).
Keep initially available functions under ~20 — performance degrades with too many tools in context. Use tool search for large tool libraries.
parallel_tool_calls: enabled by default. Disable if your tools have ordering dependencies, or if using gpt-4.1-nano (known to double-call same tool).
tool_choice: Default is auto. Use required to force a call, or named function to force a specific one. Use allowed_tools to restrict available tools per turn without breaking prompt cache.
Tool search (defer_loading: true): For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
Namespaces: Group related tools using the namespace type with a description. Helps model choose between overlapping domains (e.g., crm vs billing).
Reasoning models (o-series): Out of scope — not targeted by this workspace.
Token awareness: Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
additionalProperties: false is required for strict mode — always add it.
For code generation or high-stakes outputs: keep a human-in-the-loop review step.

Safety (OpenAI-specific)

Use the Moderation API (free) to filter unsafe content in tool outputs or user inputs.
Pass safety_identifier (hashed user ID) with API requests to help OpenAI detect abuse patterns.
Adversarial testing / red-teaming: test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
Constrain user input length to limit prompt injection surface.
Implement HITL (human in the loop) for high-stakes or irreversible tool actions.

4. Differences Summary

Concern	Anthropic/Claude	OpenAI/GPT
Schema key	`input_schema`	`parameters`
Strict mode	`strict: true` (structured outputs)	`strict: true` + `additionalProperties: false` required
Optional fields	Standard JSON Schema nullable	Add `null` to type union + keep in `required` array
Examples in tool def	`input_examples` array (recommended)	Inline in description (avoid for reasoning models)
Tool response format	Flexible; test JSON/XML/Markdown	String preferred; JSON/error codes acceptable
Max tools in context	No hard limit; keep focused	Soft limit ~20; use tool search beyond that
Best model for complex tools	Claude Opus	GPT-4.1 / GPT-5
Safety tooling	Built-in thinking/CoT, strict tool use	Moderation API, safety_identifier, HITL
Parallel calls	Default on; managed by model	Default on; disable with `parallel_tool_calls: false`
Deferred tools	N/A (use fewer tools instead)	`defer_loading: true` + tool search

5. Skill Safety Checklist (for ClawHub or custom skills)

Before using any skill/tool — especially downloaded ones:

Least privilege: Does the skill need the permissions it requests? (file system, network, exec, credentials)
No remote instruction execution: Does the skill fetch remote content and act on it as instructions? Red flag.
Input validation: Are tool inputs validated before any destructive or external action?
Reversibility: Are destructive operations guarded? (prefer trash over rm, confirmations before sends)
Output scope: Does the tool return only what's needed, or does it leak sensitive data?
Credential handling: Are secrets passed as env vars, not hardcoded or logged?
Description quality: Is the description detailed enough that the model won't misuse it?
Source trust: Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
Error handling: Does the skill return helpful errors, or raw stack traces that could leak internals?

6. OpenClaw Skill Authoring Notes

When writing new SKILL.md files or tools for zap:

Model hint in SKILL.md: If a skill is optimized for a specific model, document which model and why.
Anthropic skills: Provide input_examples for complex tools; use detailed multi-sentence descriptions.
OpenAI skills: Enable strict: true + additionalProperties: false; keep tool count small; no examples for o-series reasoning models.
Cross-model skills: Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
Token hygiene: Keep responses bounded. Document expected response size in the skill.
SKILL.md structure: Document the target model family if the skill has model-specific behavior.

Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs

11 KiB Raw Blame History Unescape Escape

LLM Tool / Skill Best Practices

1. Universal Principles (apply to both)

Tool design philosophy

Naming

Descriptions (most impactful factor)

Input schema

Tool responses

Safety fundamentals

2. Anthropic / Claude-specific

Tool definition format

Model selection guidance

Claude-specific tips

Evaluation approach (Anthropic recommended)

3. OpenAI / GPT-specific

Tool definition format

OpenAI-specific tips

Safety (OpenAI-specific)

4. Differences Summary

5. Skill Safety Checklist (for ClawHub or custom skills)

6. OpenClaw Skill Authoring Notes

11 KiB

Raw Blame History