swarm-zap/skills/llm-tool-best-practices/LLM_TOOL_BEST_PRACTICES.md

# LLM Tool / Skill Best Practices
**Sources:** Anthropic official docs + engineering blog · OpenAI official docs (March 2026)
**Purpose:** Reference for building, reviewing, or importing skills — especially when the target LLM is Claude (Anthropic) or GPT (OpenAI).

---

## 1. Universal Principles (apply to both)

### Tool design philosophy
- **Fewer, richer tools beat many narrow ones.** Don't wrap every API endpoint 1:1. Group related operations into a single tool with an `action` parameter (e.g. one `github_pr` tool with `create/review/merge` rather than three separate tools).
- **Design for agents, not developers.** Agents have limited context; they don't iterate cheaply over lists. Build tools that do the heavy lifting (filter, search, compile) so the agent only sees relevant results.
- **Each tool needs a clear, distinct purpose.** Overlapping tools confuse model selection and waste context.

### Naming
- Use `service_resource_verb` namespacing: `github_issues_search`, `slack_channel_send`.
- Prefix-based vs. suffix-based namespacing can have measurable accuracy differences — test with your target model.

### Descriptions (most impactful factor)
- Write **3–4+ sentences** per tool. Include:
  - What it does
  - When to use it (and when NOT to)
  - What each parameter means + valid values
  - What the tool returns / does NOT return
  - Edge cases and caveats
- Bad: `"Gets the stock price for a ticker."`
- Good: `"Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use when the user asks about the current or most recent price of a specific stock. Does not provide historical data, company info, or forecasts."`

### Input schema
- Use JSON Schema with typed, described properties.
- Mark truly required params as `required`; use `null` union types for optional fields.
- Use `enum` to constrain values where possible — makes invalid states unrepresentable.
- **Pass the intern test:** could a smart intern use the function correctly with only what you've given the model? If not, add the missing context.

### Tool responses
- Return **only high-signal information.** Strip opaque internal IDs, blob URLs, mime types, etc.
- Prefer semantic identifiers (names, slugs) over UUIDs where possible; if you need both, expose a `response_format: "concise" | "detailed"` param.
- Implement pagination/filtering/truncation for responses that could bloat context.
- On errors: return **actionable, plain-language error messages** — not raw stack traces or error codes.
- Choose response format (JSON, XML, Markdown) based on your own evals; no universal winner.

### Safety fundamentals
- Never give tools more permissions than needed (least privilege).
- Treat all tool input as untrusted; validate before acting.
- For destructive/irreversible actions: require explicit confirmation or human-in-the-loop.
- Log tool calls for auditability.

---

## 2. Anthropic / Claude-specific

**Sources:**
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
- https://www.anthropic.com/engineering/writing-tools-for-agents

### Tool definition format
```json
{
  "name": "get_stock_price",
  "description": "Retrieves the current stock price for a given ticker symbol. Must be a valid NYSE or NASDAQ ticker. Returns latest trade price in USD. Use when the user asks about current price of a specific stock. Does not return historical data or other company information.",
  "input_schema": {
    "type": "object",
    "properties": {
      "ticker": {
        "type": "string",
        "description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
      }
    },
    "required": ["ticker"]
  },
  "input_examples": [
    { "ticker": "AAPL" },
    { "ticker": "MSFT" }
  ]
}
```

### Model selection guidance
| Scenario | Recommended model |
|---|---|
| Complex multi-tool tasks, ambiguous queries | Claude Opus (latest) |
| Simple, well-defined single tools | Claude Haiku (note: may infer missing params) |
| Extended thinking + tools | See extended thinking guide — special rules apply |

### Claude-specific tips
- Use `input_examples` field for tools with nested objects, optional params, or format-sensitive inputs — helps Claude call them correctly.
- Use `strict: true` in tool definitions for production agents (guarantees schema conformance via structured outputs).
- **Interleaved thinking:** Enable for evals to understand *why* Claude picks (or skips) certain tools.
- Consolidate tools that are always called in sequence: if `get_weather` is always followed by `format_weather_card`, merge them.
- Keep tool response size bounded — Claude Code defaults to 25,000 token cap on responses.
- Truncation messages should be actionable: tell Claude what to do next (e.g., "Use a more specific query to narrow results").
- Claude Haiku will sometimes infer/hallucinate missing required parameters — use more explicit descriptions or upgrade to Opus for higher-stakes flows.

### Evaluation approach (Anthropic recommended)
1. Generate real-world tasks (not toy examples) — include multi-step tasks requiring 10+ tool calls.
2. Run eval agents with reasoning/CoT blocks enabled.
3. Collect metrics: accuracy, total tool calls, errors, token usage.
4. Feed transcripts back to Claude Code for automated tool improvement.
5. Use held-out test sets to avoid overfitting.

---

## 3. OpenAI / GPT-specific

**Sources:**
- https://platform.openai.com/docs/guides/function-calling
- https://platform.openai.com/docs/guides/safety-best-practices

### Tool definition format
```json
{
  "type": "function",
  "name": "get_weather",
  "description": "Retrieves current weather for the given location.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country e.g. Bogotá, Colombia"
      },
      "units": {
        "type": ["string", "null"],
        "enum": ["celsius", "fahrenheit"],
        "description": "Units the temperature will be returned in."
      }
    },
    "required": ["location", "units"],
    "additionalProperties": false
  }
}
```

### OpenAI-specific tips
- **Always enable `strict: true`** — guarantees function calls match schema exactly (uses structured outputs under the hood). Requires `additionalProperties: false` and all fields in `required` (use `null` type for optional fields).
- **Keep initially available functions under ~20** — performance degrades with too many tools in context. Use [tool search](https://platform.openai.com/docs/guides/tools-tool-search) for large tool libraries.
- **`parallel_tool_calls`:** enabled by default. Disable if your tools have ordering dependencies, or if using `gpt-4.1-nano` (known to double-call same tool).
- **`tool_choice`:** Default is `auto`. Use `required` to force a call, or named function to force a specific one. Use `allowed_tools` to restrict available tools per turn without breaking prompt cache.
- **Tool search (`defer_loading: true`):** For large ecosystems, defer rarely-used tools. Let the model search and load them on demand — keeps initial context small and accurate.
- **Namespaces:** Group related tools using the `namespace` type with a `description`. Helps model choose between overlapping domains (e.g., `crm` vs `billing`).
- **Reasoning models (o-series):** Out of scope — not targeted by this workspace.
- **Token awareness:** Function definitions are injected into the system prompt and billed as input tokens. Shorten descriptions where possible; use tool search to defer heavy schemas.
- **`additionalProperties: false`** is required for strict mode — always add it.
- For code generation or high-stakes outputs: keep a human-in-the-loop review step.

### Safety (OpenAI-specific)
- Use the **Moderation API** (free) to filter unsafe content in tool outputs or user inputs.
- Pass `safety_identifier` (hashed user ID) with API requests to help OpenAI detect abuse patterns.
- **Adversarial testing / red-teaming:** test prompt injection vectors (e.g., tool output containing instructions like "ignore previous instructions").
- Constrain user input length to limit prompt injection surface.
- Implement **HITL (human in the loop)** for high-stakes or irreversible tool actions.

---

## 4. Differences Summary

| Concern | Anthropic/Claude | OpenAI/GPT |
|---|---|---|
| Schema key | `input_schema` | `parameters` |
| Strict mode | `strict: true` (structured outputs) | `strict: true` + `additionalProperties: false` required |
| Optional fields | Standard JSON Schema nullable | Add `null` to type union + keep in `required` array |
| Examples in tool def | `input_examples` array (recommended) | Inline in description (avoid for reasoning models) |
| Tool response format | Flexible; test JSON/XML/Markdown | String preferred; JSON/error codes acceptable |
| Max tools in context | No hard limit; keep focused | Soft limit ~20; use tool search beyond that |
| Best model for complex tools | Claude Opus | GPT-4.1 / GPT-5 |
| Safety tooling | Built-in thinking/CoT, strict tool use | Moderation API, safety_identifier, HITL |
| Parallel calls | Default on; managed by model | Default on; disable with `parallel_tool_calls: false` |
| Deferred tools | N/A (use fewer tools instead) | `defer_loading: true` + tool search |

---

## 5. Skill Safety Checklist (for ClawHub or custom skills)

Before using any skill/tool — especially downloaded ones:

- [ ] **Least privilege:** Does the skill need the permissions it requests? (file system, network, exec, credentials)
- [ ] **No remote instruction execution:** Does the skill fetch remote content and act on it as instructions? Red flag.
- [ ] **Input validation:** Are tool inputs validated before any destructive or external action?
- [ ] **Reversibility:** Are destructive operations guarded? (prefer `trash` over `rm`, confirmations before sends)
- [ ] **Output scope:** Does the tool return only what's needed, or does it leak sensitive data?
- [ ] **Credential handling:** Are secrets passed as env vars, not hardcoded or logged?
- [ ] **Description quality:** Is the description detailed enough that the model won't misuse it?
- [ ] **Source trust:** Is the skill from a known, reviewed source? (see TOOLS.md notes on ClawHub flagged skills)
- [ ] **Error handling:** Does the skill return helpful errors, or raw stack traces that could leak internals?

---

## 6. OpenClaw Skill Authoring Notes

When writing new SKILL.md files or tools for zap:

- **Model hint in SKILL.md:** If a skill is optimized for a specific model, document which model and why.
- **Anthropic skills:** Provide `input_examples` for complex tools; use detailed multi-sentence descriptions.
- **OpenAI skills:** Enable `strict: true` + `additionalProperties: false`; keep tool count small; no examples for o-series reasoning models.
- **Cross-model skills:** Write descriptions that work for both — thorough, plain language, no model-specific tricks. Test on both before shipping.
- **Token hygiene:** Keep responses bounded. Document expected response size in the skill.
- **SKILL.md structure:** Document the target model family if the skill has model-specific behavior.

---

*Last updated: 2026-03-05 | Sources: Anthropic Docs, Anthropic Engineering Blog, OpenAI Docs*