From 79e61f45289efba989677212f3ff85c452b4aa23 Mon Sep 17 00:00:00 2001 From: zap Date: Thu, 5 Mar 2026 20:34:38 +0000 Subject: [PATCH] docs(references): add Anthropic + OpenAI official best practices - anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md --- memory/references/anthropic-prompt-caching.md | 72 ++++++++ .../anthropic-prompting-best-practices.md | 124 ++++++++++++++ memory/references/openai-prompt-caching.md | 56 +++++++ .../openai-prompting-best-practices.md | 156 ++++++++++++++++++ 4 files changed, 408 insertions(+) create mode 100644 memory/references/anthropic-prompt-caching.md create mode 100644 memory/references/anthropic-prompting-best-practices.md create mode 100644 memory/references/openai-prompt-caching.md create mode 100644 memory/references/openai-prompting-best-practices.md diff --git a/memory/references/anthropic-prompt-caching.md b/memory/references/anthropic-prompt-caching.md new file mode 100644 index 0000000..5403a6d --- /dev/null +++ b/memory/references/anthropic-prompt-caching.md @@ -0,0 +1,72 @@ +# Anthropic — Prompt Caching Best Practices + +**Source**: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching +**Fetched**: 2026-03-05 + +--- + +## How It Works + +1. On first request: system processes full prompt and caches the prefix once the response begins. +2. On subsequent requests with same prefix: uses cached version (much cheaper + faster). +3. Cache is checked against a cryptographic hash of the prefix content. + +## Two Caching Modes + +### Automatic Caching (recommended for multi-turn) +Add `cache_control: {"type": "ephemeral"}` at the **top level** of the request body. +- System automatically caches all content up to the last cacheable block. +- Moves cache breakpoint forward as conversation grows. +- Best for multi-turn conversations. + +### Explicit Cache Breakpoints +Place `cache_control` directly on individual content blocks. +- Finer control over exactly what gets cached. +- Use when you want to cache specific blocks (e.g., a large document) but not others. + +## Cache Lifetimes + +| Duration | Cost | Availability | +|----------|------|-------------| +| 5 minutes (default) | 1.25x base input price for write | All models | +| 1 hour | 2x base input price for write | Available at additional cost | + +- Cache **reads** cost 0.1x (10%) of base input price. +- Cache is refreshed for **no additional cost** each time cached content is used. +- Default TTL: **5 minutes** (refreshed on each use within TTL). + +## Pricing Per Million Tokens (relevant models) + +| Model | Base Input | 5m Write | 1h Write | Cache Read | Output | +|-------|-----------|----------|----------|------------|--------| +| Claude Opus 4.6 | $5 | $6.25 | $10 | $0.50 | $25 | +| Claude Sonnet 4.6 | $3 | $3.75 | $6 | $0.30 | $15 | +| Claude Haiku 4.5 | $1 | $1.25 | $2 | $0.10 | $5 | + +> Note: We use Copilot subscription (flat rate), so per-token cost doesn't apply directly. But quota burn follows similar relative proportions — caching still saves quota by reducing re-processing of identical prefixes. + +## Supported Models +- Claude Opus 4.6, 4.5, 4.1, 4 +- Claude Sonnet 4.6, 4.5, 4 +- Claude Haiku 4.5, 3.5 + +**Not supported**: Non-Claude models (GPT, GLM, Gemini) — caching is Anthropic-only. + +## What Gets Cached +Prefix order: `tools` → `system` → `messages` (up to the cache breakpoint). + +The full prefix is cached — all of tools, system, and messages up to and including the marked block. + +## Key Best Practices + +1. **Put static content first**: Instructions, system prompts, and background context should come before dynamic/user content. +2. **Use 1-hour cache for long sessions**: Default 5-minute TTL means cache expires between turns if idle > 5 min. Use 1h for agents with longer gaps. +3. **Automatic caching for multi-turn**: Simplest approach, handles the growing message history automatically. +4. **Minimum size**: Cache only activates for content > a certain token threshold (details not specified, but system prompts qualify easily). +5. **Privacy**: Cache stores KV representations and cryptographic hashes, NOT raw text. ZDR-compatible. + +## For Our Setup (OpenClaw) +- Main session system prompt is large (~15-20k tokens) and mostly static → ideal caching candidate. +- Heartbeat turns are the same every 25-30min → if using 1h cache, heartbeats keep cache warm for free. +- OpenClaw's `cacheRetention` config likely maps to this `cache_control` setting. +- Applies to: `litellm/copilot-claude-*` models only. Does NOT apply to GLM, GPT-4o, Gemini. diff --git a/memory/references/anthropic-prompting-best-practices.md b/memory/references/anthropic-prompting-best-practices.md new file mode 100644 index 0000000..8819e45 --- /dev/null +++ b/memory/references/anthropic-prompting-best-practices.md @@ -0,0 +1,124 @@ +# Anthropic — Prompting Best Practices (Claude-specific) + +**Source**: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-prompting-best-practices +**Fetched**: 2026-03-05 +**Applies to**: Claude Opus 4.6, Sonnet 4.6, Haiku 4.5 + +--- + +## General Principles + +### Be Clear and Direct +- Claude responds well to clear, explicit instructions. +- Think of Claude as a brilliant but new employee — explain your norms explicitly. +- **Golden rule**: If a colleague with minimal context would be confused by the prompt, Claude will be too. +- Be specific about output format and constraints. +- Use numbered lists or bullets when order/completeness matters. + +### Add Context and Motivation +- Explaining *why* an instruction exists helps Claude generalize correctly. + - Bad: `NEVER use ellipses` + - Better: `Your response will be read aloud by TTS, so never use ellipses since TTS won't know how to pronounce them.` + +### Use Examples (Few-shot / Multishot) +- Examples are one of the most reliable ways to steer format, tone, and structure. +- 3-5 examples for best results. +- Wrap examples in `` / `` tags. +- Make examples relevant, diverse (cover edge cases), and clearly structured. + +### Structure Prompts with XML Tags +- Use XML tags to separate instructions, context, examples, and variable inputs. +- Reduces misinterpretation in complex prompts. +- Consistent, descriptive tag names (e.g., ``, ``, ``). +- Nest tags for hierarchy (e.g., `...`). + +### Give Claude a Role +- Setting a role in the system prompt focuses behavior and tone. +- Even one sentence: `You are a helpful coding assistant specializing in Python.` + +### Long Context Prompting (20K+ tokens) +- **Put longform data at the top**: Documents and inputs above queries/instructions. Can improve quality up to 30%. +- **Use XML for document metadata**: Wrap each doc in `` tags with `` and ``. +- **Ground responses in quotes**: Ask Claude to quote relevant sections before analyzing — cuts through noise. + +--- + +## Output and Formatting + +### Communication Style (Latest Models — Opus 4.6, Sonnet 4.6) +- More direct and concise than older models. +- May skip detailed summaries after tool calls (jumps directly to next action). +- If you want visibility: `After completing a task with tool use, provide a quick summary of the work.` + +### Control Response Format +1. **Tell Claude what to do, not what not to do** + - Not: `Do not use markdown` → Better: `Your response should be flowing prose paragraphs.` +2. **Use XML format indicators** + - `Write the prose sections in tags.` +3. **Match prompt style to desired output style** + - If you want no markdown, use no markdown in your prompt. +4. **For code generation output**: Ask for specific structure, include "Go beyond the basics to create a fully-featured implementation." + +### Minimize Markdown (when needed) +Put in system prompt: +```xml + +Write in clear, flowing prose using complete paragraphs. Reserve markdown for inline code, code blocks, and simple headings (##, ###). Avoid **bold** and *italics*. Do NOT use ordered/unordered lists unless presenting truly discrete items or the user explicitly asks. Incorporate items naturally into sentences instead. + +``` + +### LaTeX +Claude Opus 4.6 defaults to LaTeX for math. To disable: +``` +Format in plain text only. Do not use LaTeX, MathJax, or any markup. Write math with standard text (/, *, ^). +``` + +--- + +## Tool Use and Agentic Systems + +### Tool Definition Best Practices +- Provide clear, detailed descriptions for each tool and parameter. +- Specify exactly when the tool should and should not be used. +- Clarify which parameters are required vs. optional. +- Give examples of correct tool call patterns. + +### Agentic System Prompts +- Include explicit instructions for error handling, ambiguity, and when to ask for clarification. +- Specify how to handle tool call failures. +- Define scope boundaries — what the agent should and should not attempt. + +--- + +## Model-Specific Notes + +### Claude Opus 4.6 +- Best for: Complex multi-step reasoning, nuanced analysis, sophisticated code generation. +- Defaults to LaTeX for math (disable if needed). +- Most verbose by default — may need to prompt for conciseness. + +### Claude Sonnet 4.6 +- Best for: General-purpose tasks, coding, analysis with good speed/quality balance. +- More concise than Opus by default. +- Strong instruction following. + +### Claude Haiku 4.5 +- Best for: Simple tasks, quick responses, high-volume low-stakes workloads. +- Fastest and cheapest Claude model. +- May need tighter prompts for complex formatting. +- Use `` tags more liberally to guide behavior. + +--- + +## Relevance for Our Subagent Prompts + +### For Claude models (Haiku/Sonnet as advisors) +- Always include a role statement. +- Use XML tags to separate role, instructions, context, and topic. +- Use examples wrapped in `` tags for structured output formats. +- Static instructions → early in prompt. Variable topic → at the end. + +### For cheaper models (GLM-4.7, see separate reference) +- Need even tighter prompting — be more explicit. +- Structured output schemas > open-ended generation. +- Constrained output length. diff --git a/memory/references/openai-prompt-caching.md b/memory/references/openai-prompt-caching.md new file mode 100644 index 0000000..b4f3ce0 --- /dev/null +++ b/memory/references/openai-prompt-caching.md @@ -0,0 +1,56 @@ +# OpenAI — Prompt Caching Best Practices + +**Source**: https://platform.openai.com/docs/guides/prompt-caching +**Fetched**: 2026-03-05 + +--- + +## How It Works + +- Caching is **automatic** — no code changes required, no extra fees. +- Enabled for all prompts ≥ 1024 tokens. +- Routes requests to servers that recently processed the same prompt prefix. +- Cache hit: significantly reduced latency + lower cost. +- Cache miss: full processing, prefix cached for future requests. + +## Cache Retention Policies + +### In-memory (default) +- Available for ALL models supporting prompt caching (gpt-4o and newer). +- Cached prefixes stay active for **5-10 minutes** of inactivity, up to **1 hour max**. +- Held in volatile GPU memory. + +### Extended (24h) +- Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1 +- Keeps cached prefixes active up to **24 hours**. +- Offloads KV tensors to GPU-local storage when memory is full. +- Opt in per request: `"prompt_cache_retention": "24h"`. +- NOT zero-data-retention eligible (unlike in-memory). + +## What Can Be Cached +- Messages array (system, user, assistant) +- Images in user messages (must be identical, same `detail` parameter) +- Tool definitions +- Structured output schemas + +## Best Practices + +1. **Static content first, dynamic content last**: Put system prompts, instructions, examples at beginning. Variable/user content at end. +2. **Use `prompt_cache_key`**: Group requests that share common prefixes under the same key to improve routing and hit rates. +3. **Stay under 15 req/min per prefix+key**: Above this rate, overflow requests go to new machines and miss cache. +4. **Maintain steady request stream**: Cache evicts after inactivity. Regular requests keep cache warm. +5. **Monitor `cached_tokens`** in `usage.prompt_tokens_details`: Track cache hit rates. + +## Pricing +- Cache writes: same as regular input tokens (no extra cost). +- Cache reads: discounted (typically 50% of input price, varies by model). + +## Verification +Check `usage.prompt_tokens_details.cached_tokens` in responses to confirm cache is working. + +## For Our Setup (OpenClaw) +- Applies to: `litellm/copilot-gpt-*`, `litellm/gpt-*`, `litellm/o*` models. +- Automatic — no OpenClaw config needed for basic caching on GPT models. +- For 24h extended retention: need to pass `prompt_cache_retention: "24h"` in model params. +- Minimum prompt size: 1024 tokens (our system prompt easily exceeds this). +- Does NOT apply to Claude models (those use Anthropic's mechanism). diff --git a/memory/references/openai-prompting-best-practices.md b/memory/references/openai-prompting-best-practices.md new file mode 100644 index 0000000..970b7c4 --- /dev/null +++ b/memory/references/openai-prompting-best-practices.md @@ -0,0 +1,156 @@ +# OpenAI — Prompt Engineering Best Practices + +**Source**: https://platform.openai.com/docs/guides/prompt-engineering +**Source**: https://platform.openai.com/docs/guides/optimizing-llm-accuracy +**Fetched**: 2026-03-05 +**Applies to**: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models) + +--- + +## Model Types and When to Use Each + +| Model Type | Speed | Cost | Best For | Prompting Style | +|------------|-------|------|----------|-----------------| +| Reasoning (o3, o4-mini) | Slow | High | Complex multi-step, math, planning | Less instruction-heavy — model reasons internally | +| Large GPT (gpt-5.2, gpt-4.1) | Medium | Medium | General tasks, coding, analysis | Explicit instructions work well | +| Small GPT (gpt-5-mini, gpt-4.1-nano) | Fast | Low | Simple tasks, formatting, classification | More explicit instructions needed | + +**When in doubt**: gpt-4.1 is the recommended balance of intelligence, speed, and cost. + +**Important**: Reasoning models and GPT models need to be prompted differently: +- Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally. +- GPT models: Benefit from explicit step-by-step instructions ("think through this step by step"). + +--- + +## Message Roles and Priority + +| Role | Priority | Purpose | +|------|----------|---------| +| `developer` | Highest | System rules, business logic, application-level instructions | +| `user` | Medium | End-user inputs and requests | +| `assistant` | — | Model-generated responses | + +Note: `instructions` parameter in Responses API = top-level developer message, takes priority over `input`. + +Important: `instructions` is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn). + +--- + +## Core Prompt Engineering Techniques + +### 1. Write Clear Instructions +- Be explicit about desired format, length, tone, and constraints. +- Provide context — WHY the instruction matters. +- Specify what to do rather than only what not to do. +- Use numbered steps when sequence matters. + +### 2. Split Complex Tasks into Subtasks +- Complex tasks are error-prone as single prompts. +- Chain simpler prompts: classification → generation → verification. +- Intent classification → routing to specialized prompts. +- Summarize long conversations before sending to model. + +### 3. Give the Model Time to "Think" (GPT models) +- Ask the model to reason before answering: "Before answering, think through the problem step by step." +- Ask the model to check its own reasoning: "Review your answer and identify any errors." +- Ask for a chain of thought in a scratchpad before final output. + +### 4. Provide Reference Text +- Include documents, examples, or facts the model should use. +- Instruct the model to answer ONLY based on provided context. +- Ask it to quote from reference material when answering. + +### 5. Use External Tools +- Retrieval (RAG): when model lacks current or proprietary knowledge. +- Code execution: for precise math, data analysis. +- Function calling: for structured external actions. + +### 6. Test Changes Systematically +- Define eval criteria before changing prompts. +- Test on diverse samples including edge cases. +- Track performance metrics, don't rely on vibes. +- Pin to specific model snapshots (e.g., `gpt-4.1-2025-04-14`) for production. + +--- + +## Prompt Structure Best Practices + +Recommended order in `developer` message: +1. **Identity**: Purpose, communication style, high-level goals. +2. **Instructions**: Rules, what to do and not do, output format. +3. **Examples**: Few-shot examples (in `` blocks or as messages). +4. **Context/documents**: Reference material (with XML tags for clarity). +5. **Delimiters**: Use markdown headers AND XML tags to delineate sections. + +Use XML tags to separate document content from instructions: +```xml + + filename.txt + + ... + + +``` + +--- + +## LLM Optimization Framework (from Optimizing LLM Accuracy guide) + +### Two Axes of Optimization + +**Context optimization** (right information in context): +- Model lacks factual/domain knowledge → add RAG +- Knowledge is outdated → use retrieval +- Needs proprietary data → inject context + +**LLM optimization** (consistent behavior): +- Inconsistent output format → add examples (few-shot) +- Wrong tone/style → adjust system prompt +- Reasoning not followed → fine-tune + +### Optimization Ladder +1. **Start**: Simple prompt + evaluation set +2. **Add static few-shot examples** → improves consistency +3. **Add dynamic few-shot (RAG)** → improves accuracy for diverse inputs +4. **Fine-tuning** → for high-volume tasks needing consistent style/format +5. **Fact-checking step** → for accuracy on high-stakes tasks + +### Evaluation Best Practices +- Build eval set of 20+ Q&A pairs before advanced optimization. +- Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment). +- Separate evaluation on high-stakes "tail" queries from aggregate metrics. +- Use evals to monitor prompt performance across model upgrades. + +--- + +## Structured Outputs +- Use `response_format: json_schema` to enforce JSON output schemas. +- Eliminates format retries entirely. +- Reduces output tokens (structured output is more concise than prose). +- Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series. + +--- + +## Relevance for Our Subagent Prompts + +### For GPT models (copilot-gpt-* subagents) +- Use `developer` role for system/role instructions. +- Include few-shot examples for structured output tasks. +- Use `response_format: json_schema` for any scored/structured council output. +- For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate. +- Reserve gpt-5.2+ for complex reasoning tasks. + +### For reasoning models (o3, o4-mini) +- Don't over-specify reasoning steps — model handles internally. +- Use for tasks requiring deep analysis or multi-step planning. +- Much slower and more expensive — use sparingly. + +### Subagent model selection cheat sheet +| Task | Recommended Model | +|------|------------------| +| Council advisors (opinion/brainstorm) | zai/glm-4.7 (free) or copilot-gpt-5-mini | +| Council referee / synthesis | copilot-claude-sonnet-4.6 | +| Code generation / review | copilot-claude-sonnet-4.6 or copilot-gpt-5.2 | +| Simple formatting / classification | zai/glm-4.7-flash or copilot-gpt-5-nano | +| Deep reasoning / architecture review | copilot-claude-opus-4.6 or o3 |