- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md
3.4 KiB
3.4 KiB
Anthropic — Prompt Caching Best Practices
Source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching Fetched: 2026-03-05
How It Works
- On first request: system processes full prompt and caches the prefix once the response begins.
- On subsequent requests with same prefix: uses cached version (much cheaper + faster).
- Cache is checked against a cryptographic hash of the prefix content.
Two Caching Modes
Automatic Caching (recommended for multi-turn)
Add cache_control: {"type": "ephemeral"} at the top level of the request body.
- System automatically caches all content up to the last cacheable block.
- Moves cache breakpoint forward as conversation grows.
- Best for multi-turn conversations.
Explicit Cache Breakpoints
Place cache_control directly on individual content blocks.
- Finer control over exactly what gets cached.
- Use when you want to cache specific blocks (e.g., a large document) but not others.
Cache Lifetimes
| Duration | Cost | Availability |
|---|---|---|
| 5 minutes (default) | 1.25x base input price for write | All models |
| 1 hour | 2x base input price for write | Available at additional cost |
- Cache reads cost 0.1x (10%) of base input price.
- Cache is refreshed for no additional cost each time cached content is used.
- Default TTL: 5 minutes (refreshed on each use within TTL).
Pricing Per Million Tokens (relevant models)
| Model | Base Input | 5m Write | 1h Write | Cache Read | Output |
|---|---|---|---|---|---|
| Claude Opus 4.6 | $5 | $6.25 | $10 | $0.50 | $25 |
| Claude Sonnet 4.6 | $3 | $3.75 | $6 | $0.30 | $15 |
| Claude Haiku 4.5 | $1 | $1.25 | $2 | $0.10 | $5 |
Note: We use Copilot subscription (flat rate), so per-token cost doesn't apply directly. But quota burn follows similar relative proportions — caching still saves quota by reducing re-processing of identical prefixes.
Supported Models
- Claude Opus 4.6, 4.5, 4.1, 4
- Claude Sonnet 4.6, 4.5, 4
- Claude Haiku 4.5, 3.5
Not supported: Non-Claude models (GPT, GLM, Gemini) — caching is Anthropic-only.
What Gets Cached
Prefix order: tools → system → messages (up to the cache breakpoint).
The full prefix is cached — all of tools, system, and messages up to and including the marked block.
Key Best Practices
- Put static content first: Instructions, system prompts, and background context should come before dynamic/user content.
- Use 1-hour cache for long sessions: Default 5-minute TTL means cache expires between turns if idle > 5 min. Use 1h for agents with longer gaps.
- Automatic caching for multi-turn: Simplest approach, handles the growing message history automatically.
- Minimum size: Cache only activates for content > a certain token threshold (details not specified, but system prompts qualify easily).
- Privacy: Cache stores KV representations and cryptographic hashes, NOT raw text. ZDR-compatible.
For Our Setup (OpenClaw)
- Main session system prompt is large (~15-20k tokens) and mostly static → ideal caching candidate.
- Heartbeat turns are the same every 25-30min → if using 1h cache, heartbeats keep cache warm for free.
- OpenClaw's
cacheRetentionconfig likely maps to thiscache_controlsetting. - Applies to:
litellm/copilot-claude-*models only. Does NOT apply to GLM, GPT-4o, Gemini.