Files
swarm-master/swarm-common/obsidian-vault/zap/memory/references/openai-prompt-caching.md
William Valentin 4b1afb1073 feat: add swarm-common obsidian vault
Add Obsidian vault to the swarm-common virtiofs share for access
from zap VM and other VMs. Contains agent memory, notes, and
infrastructure documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 15:36:02 -07:00

2.5 KiB

OpenAI — Prompt Caching Best Practices

Source: https://platform.openai.com/docs/guides/prompt-caching Fetched: 2026-03-05


How It Works

  • Caching is automatic — no code changes required, no extra fees.
  • Enabled for all prompts ≥ 1024 tokens.
  • Routes requests to servers that recently processed the same prompt prefix.
  • Cache hit: significantly reduced latency + lower cost.
  • Cache miss: full processing, prefix cached for future requests.

Cache Retention Policies

In-memory (default)

  • Available for ALL models supporting prompt caching (gpt-4o and newer).
  • Cached prefixes stay active for 5-10 minutes of inactivity, up to 1 hour max.
  • Held in volatile GPU memory.

Extended (24h)

  • Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
  • Keeps cached prefixes active up to 24 hours.
  • Offloads KV tensors to GPU-local storage when memory is full.
  • Opt in per request: "prompt_cache_retention": "24h".
  • NOT zero-data-retention eligible (unlike in-memory).

What Can Be Cached

  • Messages array (system, user, assistant)
  • Images in user messages (must be identical, same detail parameter)
  • Tool definitions
  • Structured output schemas

Best Practices

  1. Static content first, dynamic content last: Put system prompts, instructions, examples at beginning. Variable/user content at end.
  2. Use prompt_cache_key: Group requests that share common prefixes under the same key to improve routing and hit rates.
  3. Stay under 15 req/min per prefix+key: Above this rate, overflow requests go to new machines and miss cache.
  4. Maintain steady request stream: Cache evicts after inactivity. Regular requests keep cache warm.
  5. Monitor cached_tokens in usage.prompt_tokens_details: Track cache hit rates.

Pricing

  • Cache writes: same as regular input tokens (no extra cost).
  • Cache reads: discounted (typically 50% of input price, varies by model).

Verification

Check usage.prompt_tokens_details.cached_tokens in responses to confirm cache is working.

For Our Setup (OpenClaw)

  • Applies to: litellm/copilot-gpt-*, litellm/gpt-*, litellm/o* models.
  • Automatic — no OpenClaw config needed for basic caching on GPT models.
  • For 24h extended retention: need to pass prompt_cache_retention: "24h" in model params.
  • Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
  • Does NOT apply to Claude models (those use Anthropic's mechanism).