docs(references): add Anthropic + OpenAI official best practices

- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md
2026-03-05 20:34:38 +00:00
parent c2fe8155e3
commit 79e61f4528
4 changed files with 408 additions and 0 deletions
@@ -0,0 +1,56 @@
+# OpenAI — Prompt Caching Best Practices
+
+**Source**: https://platform.openai.com/docs/guides/prompt-caching
+**Fetched**: 2026-03-05
+
+---
+
+## How It Works
+
+- Caching is **automatic** — no code changes required, no extra fees.
+- Enabled for all prompts ≥ 1024 tokens.
+- Routes requests to servers that recently processed the same prompt prefix.
+- Cache hit: significantly reduced latency + lower cost.
+- Cache miss: full processing, prefix cached for future requests.
+
+## Cache Retention Policies
+
+### In-memory (default)
+- Available for ALL models supporting prompt caching (gpt-4o and newer).
+- Cached prefixes stay active for **5-10 minutes** of inactivity, up to **1 hour max**.
+- Held in volatile GPU memory.
+
+### Extended (24h)
+- Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
+- Keeps cached prefixes active up to **24 hours**.
+- Offloads KV tensors to GPU-local storage when memory is full.
+- Opt in per request: `"prompt_cache_retention": "24h"`.
+- NOT zero-data-retention eligible (unlike in-memory).
+
+## What Can Be Cached
+- Messages array (system, user, assistant)
+- Images in user messages (must be identical, same `detail` parameter)
+- Tool definitions
+- Structured output schemas
+
+## Best Practices
+
+1. **Static content first, dynamic content last**: Put system prompts, instructions, examples at beginning. Variable/user content at end.
+2. **Use `prompt_cache_key`**: Group requests that share common prefixes under the same key to improve routing and hit rates.
+3. **Stay under 15 req/min per prefix+key**: Above this rate, overflow requests go to new machines and miss cache.
+4. **Maintain steady request stream**: Cache evicts after inactivity. Regular requests keep cache warm.
+5. **Monitor `cached_tokens`** in `usage.prompt_tokens_details`: Track cache hit rates.
+
+## Pricing
+- Cache writes: same as regular input tokens (no extra cost).
+- Cache reads: discounted (typically 50% of input price, varies by model).
+
+## Verification
+Check `usage.prompt_tokens_details.cached_tokens` in responses to confirm cache is working.
+
+## For Our Setup (OpenClaw)
+- Applies to: `litellm/copilot-gpt-*`, `litellm/gpt-*`, `litellm/o*` models.
+- Automatic — no OpenClaw config needed for basic caching on GPT models.
+- For 24h extended retention: need to pass `prompt_cache_retention: "24h"` in model params.
+- Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
+- Does NOT apply to Claude models (those use Anthropic's mechanism).