Files
swarm-master/swarm-common/obsidian-vault/zap/memory/references/openai-prompt-caching.md
William Valentin 4b1afb1073 feat: add swarm-common obsidian vault
Add Obsidian vault to the swarm-common virtiofs share for access
from zap VM and other VMs. Contains agent memory, notes, and
infrastructure documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 15:36:02 -07:00

57 lines
2.5 KiB
Markdown

# OpenAI — Prompt Caching Best Practices
**Source**: https://platform.openai.com/docs/guides/prompt-caching
**Fetched**: 2026-03-05
---
## How It Works
- Caching is **automatic** — no code changes required, no extra fees.
- Enabled for all prompts ≥ 1024 tokens.
- Routes requests to servers that recently processed the same prompt prefix.
- Cache hit: significantly reduced latency + lower cost.
- Cache miss: full processing, prefix cached for future requests.
## Cache Retention Policies
### In-memory (default)
- Available for ALL models supporting prompt caching (gpt-4o and newer).
- Cached prefixes stay active for **5-10 minutes** of inactivity, up to **1 hour max**.
- Held in volatile GPU memory.
### Extended (24h)
- Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
- Keeps cached prefixes active up to **24 hours**.
- Offloads KV tensors to GPU-local storage when memory is full.
- Opt in per request: `"prompt_cache_retention": "24h"`.
- NOT zero-data-retention eligible (unlike in-memory).
## What Can Be Cached
- Messages array (system, user, assistant)
- Images in user messages (must be identical, same `detail` parameter)
- Tool definitions
- Structured output schemas
## Best Practices
1. **Static content first, dynamic content last**: Put system prompts, instructions, examples at beginning. Variable/user content at end.
2. **Use `prompt_cache_key`**: Group requests that share common prefixes under the same key to improve routing and hit rates.
3. **Stay under 15 req/min per prefix+key**: Above this rate, overflow requests go to new machines and miss cache.
4. **Maintain steady request stream**: Cache evicts after inactivity. Regular requests keep cache warm.
5. **Monitor `cached_tokens`** in `usage.prompt_tokens_details`: Track cache hit rates.
## Pricing
- Cache writes: same as regular input tokens (no extra cost).
- Cache reads: discounted (typically 50% of input price, varies by model).
## Verification
Check `usage.prompt_tokens_details.cached_tokens` in responses to confirm cache is working.
## For Our Setup (OpenClaw)
- Applies to: `litellm/copilot-gpt-*`, `litellm/gpt-*`, `litellm/o*` models.
- Automatic — no OpenClaw config needed for basic caching on GPT models.
- For 24h extended retention: need to pass `prompt_cache_retention: "24h"` in model params.
- Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
- Does NOT apply to Claude models (those use Anthropic's mechanism).