Files

William Valentin 4b1afb1073 feat: add swarm-common obsidian vault

Add Obsidian vault to the swarm-common virtiofs share for access
from zap VM and other VMs. Contains agent memory, notes, and
infrastructure documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-19 15:36:02 -07:00

2.5 KiB

Raw Blame History

OpenAI — Prompt Caching Best Practices

Source: https://platform.openai.com/docs/guides/prompt-caching Fetched: 2026-03-05

How It Works

Caching is automatic — no code changes required, no extra fees.
Enabled for all prompts ≥ 1024 tokens.
Routes requests to servers that recently processed the same prompt prefix.
Cache hit: significantly reduced latency + lower cost.
Cache miss: full processing, prefix cached for future requests.

Cache Retention Policies

In-memory (default)

Available for ALL models supporting prompt caching (gpt-4o and newer).
Cached prefixes stay active for 5-10 minutes of inactivity, up to 1 hour max.
Held in volatile GPU memory.

Extended (24h)

Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
Keeps cached prefixes active up to 24 hours.
Offloads KV tensors to GPU-local storage when memory is full.
Opt in per request: "prompt_cache_retention": "24h".
NOT zero-data-retention eligible (unlike in-memory).

What Can Be Cached

Messages array (system, user, assistant)
Images in user messages (must be identical, same detail parameter)
Tool definitions
Structured output schemas

Best Practices

Static content first, dynamic content last: Put system prompts, instructions, examples at beginning. Variable/user content at end.
Use prompt_cache_key: Group requests that share common prefixes under the same key to improve routing and hit rates.
Stay under 15 req/min per prefix+key: Above this rate, overflow requests go to new machines and miss cache.
Maintain steady request stream: Cache evicts after inactivity. Regular requests keep cache warm.
Monitor cached_tokens in usage.prompt_tokens_details: Track cache hit rates.

Pricing

Cache writes: same as regular input tokens (no extra cost).
Cache reads: discounted (typically 50% of input price, varies by model).

Verification

Check usage.prompt_tokens_details.cached_tokens in responses to confirm cache is working.

For Our Setup (OpenClaw)

Applies to: litellm/copilot-gpt-*, litellm/gpt-*, litellm/o* models.
Automatic — no OpenClaw config needed for basic caching on GPT models.
For 24h extended retention: need to pass prompt_cache_retention: "24h" in model params.
Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
Does NOT apply to Claude models (those use Anthropic's mechanism).

2.5 KiB Raw Blame History

OpenAI — Prompt Caching Best Practices

How It Works

Cache Retention Policies

In-memory (default)

Extended (24h)

What Can Be Cached

Best Practices

Pricing

Verification

For Our Setup (OpenClaw)

2.5 KiB

Raw Blame History