feat(skills): add llm-tool-best-practices reference from Anthropic + OpenAI official docs
This commit is contained in:
201
skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md
Normal file
201
skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md
Normal file
@@ -0,0 +1,201 @@
|
||||
# LLM Tool / Skill Best Practices
|
||||
**Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026)
|
||||
**Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).
|
||||
|
||||
---
|
||||
|
||||
## 1. Universal Principles (apply to both)
|
||||
|
||||
### Tool design philosophy
|
||||
- **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools).
|
||||
- **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
|
||||
- **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context.
|
||||
|
||||
### Naming
|
||||
- Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`.
|
||||
- Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.
|
||||
|
||||
### Descriptions (most impactful factor)
|
||||
- Write **3–4+ sentences** per tool. Include:
|
||||
- What it does
|
||||
- When to use it (and when NOT to)
|
||||
- What each parameter means + valid values
|
||||
- What the tool returns / does NOT return
|
||||
- Edge cases and caveats
|
||||
- Bad: `"Gets the stock price for a ticker."`
|
||||
- Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."`
|
||||
|
||||
### Input schema
|
||||
- Use JSON Schema with typed, described properties.
|
||||
- Mark truly required params as `required`; use `null` union types for optional fields.
|
||||
- Use `enum` to constrain values where possible — makes invalid states unrepresentable.
|
||||
- **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.
|
||||
|
||||
### Tool responses
|
||||
- Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc.
|
||||
- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param.
|
||||
- Implement pagination/filtering/truncation for responses that could bloat context.
|
||||
- On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes.
|
||||
- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.
|
||||
|
||||
### Safety fundamentals
|
||||
- Never give tools more permissions than needed (least privilege).
|
||||
- Treat all tool input as untrusted; validate before acting.
|
||||
- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
|
||||
- Log tool calls for auditability.
|
||||
|
||||
---
|
||||
|
||||
## 2. Anthropic / Claude-specific
|
||||
|
||||
**Sources:**
|
||||
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
|
||||
- https://www.anthropic.com/engineering/writing-tools-for-agents
|
||||
|
||||
### Tool definition format
|
||||
```json
|
||||
{
|
||||
"name": "get_stock_price",
|
||||
"description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
|
||||
"input_schema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"ticker": {
|
||||
"type": "string",
|
||||
"description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
|
||||
}
|
||||
},
|
||||
"required": ["ticker"]
|
||||
},
|
||||
"input_examples": [
|
||||
{ "ticker": "AAPL" },
|
||||
{ "ticker": "MSFT" }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Model selection guidance
|
||||
| Scenario | Recommended model |
|
||||
|---|---|
|
||||
| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) |
|
||||
| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) |
|
||||
| Extended thinking + tools | See extended thinking guide — special rules apply |
|
||||
|
||||
### Claude-specific tips
|
||||
- Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
|
||||
- Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs).
|
||||
- **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools.
|
||||
- Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them.
|
||||
- Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
|
||||
- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
|
||||
- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.
|
||||
|
||||
### Evaluation approach (Anthropic recommended)
|
||||
1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
|
||||
2. Run eval agents with reasoning/CoT blocks enabled.
|
||||
3. Collect metrics: accuracy, total tool calls, errors, token usage.
|
||||
4. Feed transcripts back to Claude Code for automated tool improvement.
|
||||
5. Use held-out test sets to avoid overfitting.
|
||||
|
||||
---
|
||||
|
||||
## 3. OpenAI / GPT-specific
|
||||
|
||||
**Sources:**
|
||||
- https://platform.openai.com/docs/guides/function-calling
|
||||
- https://platform.openai.com/docs/guides/safety-best-practices
|
||||
|
||||
### Tool definition format
|
||||
```json
|
||||
{
|
||||
"type": "function",
|
||||
"name": "get_weather",
|
||||
"description": "Retrieves current weather for the given location.",
|
||||
"strict": true,
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "City and country e.g. Bogotá, Colombia"
|
||||
},
|
||||
"units": {
|
||||
"type": ["string", "null"],
|
||||
"enum": ["celsius", "fahrenheit"],
|
||||
"description": "Units the temperature will be returned in."
|
||||
}
|
||||
},
|
||||
"required": ["location", "units"],
|
||||
"additionalProperties": false
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### OpenAI-specific tips
|
||||
- **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields).
|
||||
- **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries.
|
||||
- **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool).
|
||||
- **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache.
|
||||
- **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
|
||||
- **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`).
|
||||
- **Reasoning models (o4-mini, GPT-5, etc.):** Do NOT add examples in descriptions — this can hurt reasoning model performance. Pass reasoning items back with tool call outputs (required).
|
||||
- **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
|
||||
- **`additionalProperties: false`** is required for strict mode — always add it.
|
||||
- For code generation or high-stakes outputs: keep a human-in-the-loop review step.
|
||||
|
||||
### Safety (OpenAI-specific)
|
||||
- Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs.
|
||||
- Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns.
|
||||
- **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
|
||||
- Constrain user input length to limit prompt injection surface.
|
||||
- Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions.
|
||||
|
||||
---
|
||||
|
||||
## 4. Differences Summary
|
||||
|
||||
| Concern | Anthropic/Claude | OpenAI/GPT |
|
||||
|---|---|---|
|
||||
| Schema key | `input_schema` | `parameters` |
|
||||
| Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required |
|
||||
| Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array |
|
||||
| Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) |
|
||||
| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable |
|
||||
| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that |
|
||||
| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 (not nano for parallel calls) |
|
||||
| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL |
|
||||
| Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` |
|
||||
| Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search |
|
||||
|
||||
---
|
||||
|
||||
## 5. Skill Safety Checklist (for ClawHub or custom skills)
|
||||
|
||||
Before using any skill/tool — especially downloaded ones:
|
||||
|
||||
- [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials)
|
||||
- [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag.
|
||||
- [ ] **Input validation:** Are tool inputs validated before any destructive or external action?
|
||||
- [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends)
|
||||
- [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data?
|
||||
- [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged?
|
||||
- [ ] **Description quality:** Is the description detailed enough that the model won't misuse it?
|
||||
- [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
|
||||
- [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals?
|
||||
|
||||
---
|
||||
|
||||
## 6. OpenClaw Skill Authoring Notes
|
||||
|
||||
When writing new SKILL.md files or tools for zap:
|
||||
|
||||
- **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why.
|
||||
- **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions.
|
||||
- **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models.
|
||||
- **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
|
||||
- **Token hygiene:** Keep responses bounded. Document expected response size in the skill.
|
||||
- **SKILL.md structure:** Document the target model family if the skill has model-specific behavior.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs*
|
||||
28
skills/llm-tool-best-practices/SKILL.md
Normal file
28
skills/llm-tool-best-practices/SKILL.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# SKILL.md — llm-tool-best-practices
|
||||
|
||||
## Purpose
|
||||
Reference guide synthesizing official Anthropic and OpenAI best practices for building, reviewing, and importing agent skills/tools. Use this when:
|
||||
- Writing a new skill (SKILL.md or tool definition)
|
||||
- Reviewing a downloaded ClawHub skill for safety
|
||||
- Debugging why a tool is being misused or ignored by the model
|
||||
- Optimizing a skill for a specific LLM (Claude vs. GPT)
|
||||
|
||||
## How to use
|
||||
Read `LLM_TOOL_BEST_PRACTICES.md` in this directory for the full reference.
|
||||
|
||||
Key sections:
|
||||
1. **Universal principles** — applies to all LLMs
|
||||
2. **Anthropic/Claude-specific** — schema format, model selection, `input_examples`, eval approach
|
||||
3. **OpenAI/GPT-specific** — `strict` mode, tool count limits, reasoning model caveats, safety APIs
|
||||
4. **Differences summary** — quick comparison table
|
||||
5. **Skill safety checklist** — use before activating any downloaded skill
|
||||
6. **OpenClaw authoring notes** — how to write SKILL.md files for this workspace
|
||||
|
||||
## Sources
|
||||
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
|
||||
- https://www.anthropic.com/engineering/writing-tools-for-agents
|
||||
- https://platform.openai.com/docs/guides/function-calling
|
||||
- https://platform.openai.com/docs/guides/safety-best-practices
|
||||
|
||||
## Last updated
|
||||
2026-03-05
|
||||
Reference in New Issue
Block a user