Files
swarm-master/swarm-common/obsidian-vault/zap/memory/references/anthropic-prompt-caching.md
William Valentin 4b1afb1073 feat: add swarm-common obsidian vault
Add Obsidian vault to the swarm-common virtiofs share for access
from zap VM and other VMs. Contains agent memory, notes, and
infrastructure documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 15:36:02 -07:00

3.4 KiB

Anthropic — Prompt Caching Best Practices

Source: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching Fetched: 2026-03-05


How It Works

  1. On first request: system processes full prompt and caches the prefix once the response begins.
  2. On subsequent requests with same prefix: uses cached version (much cheaper + faster).
  3. Cache is checked against a cryptographic hash of the prefix content.

Two Caching Modes

Add cache_control: {"type": "ephemeral"} at the top level of the request body.

  • System automatically caches all content up to the last cacheable block.
  • Moves cache breakpoint forward as conversation grows.
  • Best for multi-turn conversations.

Explicit Cache Breakpoints

Place cache_control directly on individual content blocks.

  • Finer control over exactly what gets cached.
  • Use when you want to cache specific blocks (e.g., a large document) but not others.

Cache Lifetimes

Duration Cost Availability
5 minutes (default) 1.25x base input price for write All models
1 hour 2x base input price for write Available at additional cost
  • Cache reads cost 0.1x (10%) of base input price.
  • Cache is refreshed for no additional cost each time cached content is used.
  • Default TTL: 5 minutes (refreshed on each use within TTL).

Pricing Per Million Tokens (relevant models)

Model Base Input 5m Write 1h Write Cache Read Output
Claude Opus 4.6 $5 $6.25 $10 $0.50 $25
Claude Sonnet 4.6 $3 $3.75 $6 $0.30 $15
Claude Haiku 4.5 $1 $1.25 $2 $0.10 $5

Note: We use Copilot subscription (flat rate), so per-token cost doesn't apply directly. But quota burn follows similar relative proportions — caching still saves quota by reducing re-processing of identical prefixes.

Supported Models

  • Claude Opus 4.6, 4.5, 4.1, 4
  • Claude Sonnet 4.6, 4.5, 4
  • Claude Haiku 4.5, 3.5

Not supported: Non-Claude models (GPT, GLM, Gemini) — caching is Anthropic-only.

What Gets Cached

Prefix order: toolssystemmessages (up to the cache breakpoint).

The full prefix is cached — all of tools, system, and messages up to and including the marked block.

Key Best Practices

  1. Put static content first: Instructions, system prompts, and background context should come before dynamic/user content.
  2. Use 1-hour cache for long sessions: Default 5-minute TTL means cache expires between turns if idle > 5 min. Use 1h for agents with longer gaps.
  3. Automatic caching for multi-turn: Simplest approach, handles the growing message history automatically.
  4. Minimum size: Cache only activates for content > a certain token threshold (details not specified, but system prompts qualify easily).
  5. Privacy: Cache stores KV representations and cryptographic hashes, NOT raw text. ZDR-compatible.

For Our Setup (OpenClaw)

  • Main session system prompt is large (~15-20k tokens) and mostly static → ideal caching candidate.
  • Heartbeat turns are the same every 25-30min → if using 1h cache, heartbeats keep cache warm for free.
  • OpenClaw's cacheRetention config likely maps to this cache_control setting.
  • Applies to: litellm/copilot-claude-* models only. Does NOT apply to GLM, GPT-4o, Gemini.