- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md
2.5 KiB
2.5 KiB
OpenAI — Prompt Caching Best Practices
Source: https://platform.openai.com/docs/guides/prompt-caching Fetched: 2026-03-05
How It Works
- Caching is automatic — no code changes required, no extra fees.
- Enabled for all prompts ≥ 1024 tokens.
- Routes requests to servers that recently processed the same prompt prefix.
- Cache hit: significantly reduced latency + lower cost.
- Cache miss: full processing, prefix cached for future requests.
Cache Retention Policies
In-memory (default)
- Available for ALL models supporting prompt caching (gpt-4o and newer).
- Cached prefixes stay active for 5-10 minutes of inactivity, up to 1 hour max.
- Held in volatile GPU memory.
Extended (24h)
- Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
- Keeps cached prefixes active up to 24 hours.
- Offloads KV tensors to GPU-local storage when memory is full.
- Opt in per request:
"prompt_cache_retention": "24h". - NOT zero-data-retention eligible (unlike in-memory).
What Can Be Cached
- Messages array (system, user, assistant)
- Images in user messages (must be identical, same
detailparameter) - Tool definitions
- Structured output schemas
Best Practices
- Static content first, dynamic content last: Put system prompts, instructions, examples at beginning. Variable/user content at end.
- Use
prompt_cache_key: Group requests that share common prefixes under the same key to improve routing and hit rates. - Stay under 15 req/min per prefix+key: Above this rate, overflow requests go to new machines and miss cache.
- Maintain steady request stream: Cache evicts after inactivity. Regular requests keep cache warm.
- Monitor
cached_tokensinusage.prompt_tokens_details: Track cache hit rates.
Pricing
- Cache writes: same as regular input tokens (no extra cost).
- Cache reads: discounted (typically 50% of input price, varies by model).
Verification
Check usage.prompt_tokens_details.cached_tokens in responses to confirm cache is working.
For Our Setup (OpenClaw)
- Applies to:
litellm/copilot-gpt-*,litellm/gpt-*,litellm/o*models. - Automatic — no OpenClaw config needed for basic caching on GPT models.
- For 24h extended retention: need to pass
prompt_cache_retention: "24h"in model params. - Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
- Does NOT apply to Claude models (those use Anthropic's mechanism).