# OpenAI — Prompt Caching Best Practices **Source**: https://platform.openai.com/docs/guides/prompt-caching **Fetched**: 2026-03-05 --- ## How It Works - Caching is **automatic** — no code changes required, no extra fees. - Enabled for all prompts ≥ 1024 tokens. - Routes requests to servers that recently processed the same prompt prefix. - Cache hit: significantly reduced latency + lower cost. - Cache miss: full processing, prefix cached for future requests. ## Cache Retention Policies ### In-memory (default) - Available for ALL models supporting prompt caching (gpt-4o and newer). - Cached prefixes stay active for **5-10 minutes** of inactivity, up to **1 hour max**. - Held in volatile GPU memory. ### Extended (24h) - Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1 - Keeps cached prefixes active up to **24 hours**. - Offloads KV tensors to GPU-local storage when memory is full. - Opt in per request: `"prompt_cache_retention": "24h"`. - NOT zero-data-retention eligible (unlike in-memory). ## What Can Be Cached - Messages array (system, user, assistant) - Images in user messages (must be identical, same `detail` parameter) - Tool definitions - Structured output schemas ## Best Practices 1. **Static content first, dynamic content last**: Put system prompts, instructions, examples at beginning. Variable/user content at end. 2. **Use `prompt_cache_key`**: Group requests that share common prefixes under the same key to improve routing and hit rates. 3. **Stay under 15 req/min per prefix+key**: Above this rate, overflow requests go to new machines and miss cache. 4. **Maintain steady request stream**: Cache evicts after inactivity. Regular requests keep cache warm. 5. **Monitor `cached_tokens`** in `usage.prompt_tokens_details`: Track cache hit rates. ## Pricing - Cache writes: same as regular input tokens (no extra cost). - Cache reads: discounted (typically 50% of input price, varies by model). ## Verification Check `usage.prompt_tokens_details.cached_tokens` in responses to confirm cache is working. ## For Our Setup (OpenClaw) - Applies to: `litellm/copilot-gpt-*`, `litellm/gpt-*`, `litellm/o*` models. - Automatic — no OpenClaw config needed for basic caching on GPT models. - For 24h extended retention: need to pass `prompt_cache_retention: "24h"` in model params. - Minimum prompt size: 1024 tokens (our system prompt easily exceeds this). - Does NOT apply to Claude models (those use Anthropic's mechanism).