Files

zap 79e61f4528 docs(references): add Anthropic + OpenAI official best practices

- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit
- openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key
- anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes
- openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection

Key findings:
- Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads
- OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+
- GLM/ZAI models: neither caching mechanism applies
- Subagent model routing table added to openai-prompting-best-practices.md

2026-03-05 20:34:38 +00:00

2.5 KiB

Raw Blame History

OpenAI — Prompt Caching Best Practices

Source: https://platform.openai.com/docs/guides/prompt-caching Fetched: 2026-03-05

How It Works

Caching is automatic — no code changes required, no extra fees.
Enabled for all prompts ≥ 1024 tokens.
Routes requests to servers that recently processed the same prompt prefix.
Cache hit: significantly reduced latency + lower cost.
Cache miss: full processing, prefix cached for future requests.

Cache Retention Policies

In-memory (default)

Available for ALL models supporting prompt caching (gpt-4o and newer).
Cached prefixes stay active for 5-10 minutes of inactivity, up to 1 hour max.
Held in volatile GPU memory.

Extended (24h)

Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
Keeps cached prefixes active up to 24 hours.
Offloads KV tensors to GPU-local storage when memory is full.
Opt in per request: "prompt_cache_retention": "24h".
NOT zero-data-retention eligible (unlike in-memory).

What Can Be Cached

Messages array (system, user, assistant)
Images in user messages (must be identical, same detail parameter)
Tool definitions
Structured output schemas

Best Practices

Static content first, dynamic content last: Put system prompts, instructions, examples at beginning. Variable/user content at end.
Use prompt_cache_key: Group requests that share common prefixes under the same key to improve routing and hit rates.
Stay under 15 req/min per prefix+key: Above this rate, overflow requests go to new machines and miss cache.
Maintain steady request stream: Cache evicts after inactivity. Regular requests keep cache warm.
Monitor cached_tokens in usage.prompt_tokens_details: Track cache hit rates.

Pricing

Cache writes: same as regular input tokens (no extra cost).
Cache reads: discounted (typically 50% of input price, varies by model).

Verification

Check usage.prompt_tokens_details.cached_tokens in responses to confirm cache is working.

For Our Setup (OpenClaw)

Applies to: litellm/copilot-gpt-*, litellm/gpt-*, litellm/o* models.
Automatic — no OpenClaw config needed for basic caching on GPT models.
For 24h extended retention: need to pass prompt_cache_retention: "24h" in model params.
Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
Does NOT apply to Claude models (those use Anthropic's mechanism).

2.5 KiB Raw Blame History

OpenAI — Prompt Caching Best Practices

How It Works

Cache Retention Policies

In-memory (default)

Extended (24h)

What Can Be Cached

Best Practices

Pricing

Verification

For Our Setup (OpenClaw)

2.5 KiB

Raw Blame History