Files

zap 79e61f4528 docs(references): add Anthropic + OpenAI official best practices

- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit
- openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key
- anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes
- openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection

Key findings:
- Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads
- OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+
- GLM/ZAI models: neither caching mechanism applies
- Subagent model routing table added to openai-prompting-best-practices.md

2026-03-05 20:34:38 +00:00

3.4 KiB

Raw Blame History

Anthropic — Prompt Caching Best Practices

Source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching Fetched: 2026-03-05

How It Works

On first request: system processes full prompt and caches the prefix once the response begins.
On subsequent requests with same prefix: uses cached version (much cheaper + faster).
Cache is checked against a cryptographic hash of the prefix content.

Two Caching Modes

Automatic Caching (recommended for multi-turn)

Add cache_control: {"type": "ephemeral"} at the top level of the request body.

System automatically caches all content up to the last cacheable block.
Moves cache breakpoint forward as conversation grows.
Best for multi-turn conversations.

Explicit Cache Breakpoints

Place cache_control directly on individual content blocks.

Finer control over exactly what gets cached.
Use when you want to cache specific blocks (e.g., a large document) but not others.

Cache Lifetimes

Duration	Cost	Availability
5 minutes (default)	1.25x base input price for write	All models
1 hour	2x base input price for write	Available at additional cost

Cache reads cost 0.1x (10%) of base input price.
Cache is refreshed for no additional cost each time cached content is used.
Default TTL: 5 minutes (refreshed on each use within TTL).

Pricing Per Million Tokens (relevant models)

Model	Base Input	5m Write	1h Write	Cache Read	Output
Claude Opus 4.6	$5	$6.25	$10	$0.50	$25
Claude Sonnet 4.6	$3	$3.75	$6	$0.30	$15
Claude Haiku 4.5	$1	$1.25	$2	$0.10	$5

Note: We use Copilot subscription (flat rate), so per-token cost doesn't apply directly. But quota burn follows similar relative proportions — caching still saves quota by reducing re-processing of identical prefixes.

Supported Models

Claude Opus 4.6, 4.5, 4.1, 4
Claude Sonnet 4.6, 4.5, 4
Claude Haiku 4.5, 3.5

Not supported: Non-Claude models (GPT, GLM, Gemini) — caching is Anthropic-only.

What Gets Cached

Prefix order: tools → system → messages (up to the cache breakpoint).

The full prefix is cached — all of tools, system, and messages up to and including the marked block.

Key Best Practices

Put static content first: Instructions, system prompts, and background context should come before dynamic/user content.
Use 1-hour cache for long sessions: Default 5-minute TTL means cache expires between turns if idle > 5 min. Use 1h for agents with longer gaps.
Automatic caching for multi-turn: Simplest approach, handles the growing message history automatically.
Minimum size: Cache only activates for content > a certain token threshold (details not specified, but system prompts qualify easily).
Privacy: Cache stores KV representations and cryptographic hashes, NOT raw text. ZDR-compatible.

For Our Setup (OpenClaw)

Main session system prompt is large (~15-20k tokens) and mostly static → ideal caching candidate.
Heartbeat turns are the same every 25-30min → if using 1h cache, heartbeats keep cache warm for free.
OpenClaw's cacheRetention config likely maps to this cache_control setting.
Applies to: litellm/copilot-claude-* models only. Does NOT apply to GLM, GPT-4o, Gemini.

3.4 KiB Raw Blame History