Files
swarm-zap/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md

202 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LLM Tool / Skill Best Practices
**Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026)
**Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).
---
## 1. Universal Principles (apply to both)
### Tool design philosophy
- **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools).
- **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
- **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context.
### Naming
- Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`.
- Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.
### Descriptions (most impactful factor)
- Write **34+ sentences** per tool. Include:
- What it does
- When to use it (and when NOT to)
- What each parameter means + valid values
- What the tool returns / does NOT return
- Edge cases and caveats
- Bad: `"Gets the stock price for a ticker."`
- Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."`
### Input schema
- Use JSON Schema with typed, described properties.
- Mark truly required params as `required`; use `null` union types for optional fields.
- Use `enum` to constrain values where possible — makes invalid states unrepresentable.
- **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.
### Tool responses
- Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc.
- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param.
- Implement pagination/filtering/truncation for responses that could bloat context.
- On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes.
- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.
### Safety fundamentals
- Never give tools more permissions than needed (least privilege).
- Treat all tool input as untrusted; validate before acting.
- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
- Log tool calls for auditability.
---
## 2. Anthropic / Claude-specific
**Sources:**
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
- https://www.anthropic.com/engineering/writing-tools-for-agents
### Tool definition format
```json
{
"name": "get_stock_price",
"description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
"input_schema": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
}
},
"required": ["ticker"]
},
"input_examples": [
{ "ticker": "AAPL" },
{ "ticker": "MSFT" }
]
}
```
### Model selection guidance
| Scenario | Recommended model |
|---|---|
| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) |
| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) |
| Extended thinking + tools | See extended thinking guide — special rules apply |
### Claude-specific tips
- Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
- Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs).
- **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools.
- Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them.
- Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.
### Evaluation approach (Anthropic recommended)
1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
2. Run eval agents with reasoning/CoT blocks enabled.
3. Collect metrics: accuracy, total tool calls, errors, token usage.
4. Feed transcripts back to Claude Code for automated tool improvement.
5. Use held-out test sets to avoid overfitting.
---
## 3. OpenAI / GPT-specific
**Sources:**
- https://platform.openai.com/docs/guides/function-calling
- https://platform.openai.com/docs/guides/safety-best-practices
### Tool definition format
```json
{
"type": "function",
"name": "get_weather",
"description": "Retrieves current weather for the given location.",
"strict": true,
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
},
"units": {
"type": ["string", "null"],
"enum": ["celsius", "fahrenheit"],
"description": "Units the temperature will be returned in."
}
},
"required": ["location", "units"],
"additionalProperties": false
}
}
```
### OpenAI-specific tips
- **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields).
- **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries.
- **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool).
- **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache.
- **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
- **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`).
- **Reasoning models (o-series):** Out of scope — not targeted by this workspace.
- **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
- **`additionalProperties: false`** is required for strict mode — always add it.
- For code generation or high-stakes outputs: keep a human-in-the-loop review step.
### Safety (OpenAI-specific)
- Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs.
- Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns.
- **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
- Constrain user input length to limit prompt injection surface.
- Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions.
---
## 4. Differences Summary
| Concern | Anthropic/Claude | OpenAI/GPT |
|---|---|---|
| Schema key | `input_schema` | `parameters` |
| Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required |
| Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array |
| Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) |
| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable |
| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that |
| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 |
| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL |
| Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` |
| Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search |
---
## 5. Skill Safety Checklist (for ClawHub or custom skills)
Before using any skill/tool — especially downloaded ones:
- [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials)
- [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag.
- [ ] **Input validation:** Are tool inputs validated before any destructive or external action?
- [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends)
- [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data?
- [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged?
- [ ] **Description quality:** Is the description detailed enough that the model won't misuse it?
- [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
- [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals?
---
## 6. OpenClaw Skill Authoring Notes
When writing new SKILL.md files or tools for zap:
- **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why.
- **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions.
- **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models.
- **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
- **Token hygiene:** Keep responses bounded. Document expected response size in the skill.
- **SKILL.md structure:** Document the target model family if the skill has model-specific behavior.
---
*Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs*