docs(references): add Anthropic + OpenAI official best practices
- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md
This commit is contained in:
72
memory/references/anthropic-prompt-caching.md
Normal file
72
memory/references/anthropic-prompt-caching.md
Normal file
@@ -0,0 +1,72 @@
|
||||
# Anthropic — Prompt Caching Best Practices
|
||||
|
||||
**Source**: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
|
||||
**Fetched**: 2026-03-05
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
1. On first request: system processes full prompt and caches the prefix once the response begins.
|
||||
2. On subsequent requests with same prefix: uses cached version (much cheaper + faster).
|
||||
3. Cache is checked against a cryptographic hash of the prefix content.
|
||||
|
||||
## Two Caching Modes
|
||||
|
||||
### Automatic Caching (recommended for multi-turn)
|
||||
Add `cache_control: {"type": "ephemeral"}` at the **top level** of the request body.
|
||||
- System automatically caches all content up to the last cacheable block.
|
||||
- Moves cache breakpoint forward as conversation grows.
|
||||
- Best for multi-turn conversations.
|
||||
|
||||
### Explicit Cache Breakpoints
|
||||
Place `cache_control` directly on individual content blocks.
|
||||
- Finer control over exactly what gets cached.
|
||||
- Use when you want to cache specific blocks (e.g., a large document) but not others.
|
||||
|
||||
## Cache Lifetimes
|
||||
|
||||
| Duration | Cost | Availability |
|
||||
|----------|------|-------------|
|
||||
| 5 minutes (default) | 1.25x base input price for write | All models |
|
||||
| 1 hour | 2x base input price for write | Available at additional cost |
|
||||
|
||||
- Cache **reads** cost 0.1x (10%) of base input price.
|
||||
- Cache is refreshed for **no additional cost** each time cached content is used.
|
||||
- Default TTL: **5 minutes** (refreshed on each use within TTL).
|
||||
|
||||
## Pricing Per Million Tokens (relevant models)
|
||||
|
||||
| Model | Base Input | 5m Write | 1h Write | Cache Read | Output |
|
||||
|-------|-----------|----------|----------|------------|--------|
|
||||
| Claude Opus 4.6 | $5 | $6.25 | $10 | $0.50 | $25 |
|
||||
| Claude Sonnet 4.6 | $3 | $3.75 | $6 | $0.30 | $15 |
|
||||
| Claude Haiku 4.5 | $1 | $1.25 | $2 | $0.10 | $5 |
|
||||
|
||||
> Note: We use Copilot subscription (flat rate), so per-token cost doesn't apply directly. But quota burn follows similar relative proportions — caching still saves quota by reducing re-processing of identical prefixes.
|
||||
|
||||
## Supported Models
|
||||
- Claude Opus 4.6, 4.5, 4.1, 4
|
||||
- Claude Sonnet 4.6, 4.5, 4
|
||||
- Claude Haiku 4.5, 3.5
|
||||
|
||||
**Not supported**: Non-Claude models (GPT, GLM, Gemini) — caching is Anthropic-only.
|
||||
|
||||
## What Gets Cached
|
||||
Prefix order: `tools` → `system` → `messages` (up to the cache breakpoint).
|
||||
|
||||
The full prefix is cached — all of tools, system, and messages up to and including the marked block.
|
||||
|
||||
## Key Best Practices
|
||||
|
||||
1. **Put static content first**: Instructions, system prompts, and background context should come before dynamic/user content.
|
||||
2. **Use 1-hour cache for long sessions**: Default 5-minute TTL means cache expires between turns if idle > 5 min. Use 1h for agents with longer gaps.
|
||||
3. **Automatic caching for multi-turn**: Simplest approach, handles the growing message history automatically.
|
||||
4. **Minimum size**: Cache only activates for content > a certain token threshold (details not specified, but system prompts qualify easily).
|
||||
5. **Privacy**: Cache stores KV representations and cryptographic hashes, NOT raw text. ZDR-compatible.
|
||||
|
||||
## For Our Setup (OpenClaw)
|
||||
- Main session system prompt is large (~15-20k tokens) and mostly static → ideal caching candidate.
|
||||
- Heartbeat turns are the same every 25-30min → if using 1h cache, heartbeats keep cache warm for free.
|
||||
- OpenClaw's `cacheRetention` config likely maps to this `cache_control` setting.
|
||||
- Applies to: `litellm/copilot-claude-*` models only. Does NOT apply to GLM, GPT-4o, Gemini.
|
||||
124
memory/references/anthropic-prompting-best-practices.md
Normal file
124
memory/references/anthropic-prompting-best-practices.md
Normal file
@@ -0,0 +1,124 @@
|
||||
# Anthropic — Prompting Best Practices (Claude-specific)
|
||||
|
||||
**Source**: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/claude-prompting-best-practices
|
||||
**Fetched**: 2026-03-05
|
||||
**Applies to**: Claude Opus 4.6, Sonnet 4.6, Haiku 4.5
|
||||
|
||||
---
|
||||
|
||||
## General Principles
|
||||
|
||||
### Be Clear and Direct
|
||||
- Claude responds well to clear, explicit instructions.
|
||||
- Think of Claude as a brilliant but new employee — explain your norms explicitly.
|
||||
- **Golden rule**: If a colleague with minimal context would be confused by the prompt, Claude will be too.
|
||||
- Be specific about output format and constraints.
|
||||
- Use numbered lists or bullets when order/completeness matters.
|
||||
|
||||
### Add Context and Motivation
|
||||
- Explaining *why* an instruction exists helps Claude generalize correctly.
|
||||
- Bad: `NEVER use ellipses`
|
||||
- Better: `Your response will be read aloud by TTS, so never use ellipses since TTS won't know how to pronounce them.`
|
||||
|
||||
### Use Examples (Few-shot / Multishot)
|
||||
- Examples are one of the most reliable ways to steer format, tone, and structure.
|
||||
- 3-5 examples for best results.
|
||||
- Wrap examples in `<example>` / `<examples>` tags.
|
||||
- Make examples relevant, diverse (cover edge cases), and clearly structured.
|
||||
|
||||
### Structure Prompts with XML Tags
|
||||
- Use XML tags to separate instructions, context, examples, and variable inputs.
|
||||
- Reduces misinterpretation in complex prompts.
|
||||
- Consistent, descriptive tag names (e.g., `<instructions>`, `<context>`, `<input>`).
|
||||
- Nest tags for hierarchy (e.g., `<documents><document index="1">...</document></documents>`).
|
||||
|
||||
### Give Claude a Role
|
||||
- Setting a role in the system prompt focuses behavior and tone.
|
||||
- Even one sentence: `You are a helpful coding assistant specializing in Python.`
|
||||
|
||||
### Long Context Prompting (20K+ tokens)
|
||||
- **Put longform data at the top**: Documents and inputs above queries/instructions. Can improve quality up to 30%.
|
||||
- **Use XML for document metadata**: Wrap each doc in `<document>` tags with `<source>` and `<document_content>`.
|
||||
- **Ground responses in quotes**: Ask Claude to quote relevant sections before analyzing — cuts through noise.
|
||||
|
||||
---
|
||||
|
||||
## Output and Formatting
|
||||
|
||||
### Communication Style (Latest Models — Opus 4.6, Sonnet 4.6)
|
||||
- More direct and concise than older models.
|
||||
- May skip detailed summaries after tool calls (jumps directly to next action).
|
||||
- If you want visibility: `After completing a task with tool use, provide a quick summary of the work.`
|
||||
|
||||
### Control Response Format
|
||||
1. **Tell Claude what to do, not what not to do**
|
||||
- Not: `Do not use markdown` → Better: `Your response should be flowing prose paragraphs.`
|
||||
2. **Use XML format indicators**
|
||||
- `Write the prose sections in <smoothly_flowing_prose_paragraphs> tags.`
|
||||
3. **Match prompt style to desired output style**
|
||||
- If you want no markdown, use no markdown in your prompt.
|
||||
4. **For code generation output**: Ask for specific structure, include "Go beyond the basics to create a fully-featured implementation."
|
||||
|
||||
### Minimize Markdown (when needed)
|
||||
Put in system prompt:
|
||||
```xml
|
||||
<avoid_excessive_markdown>
|
||||
Write in clear, flowing prose using complete paragraphs. Reserve markdown for inline code, code blocks, and simple headings (##, ###). Avoid **bold** and *italics*. Do NOT use ordered/unordered lists unless presenting truly discrete items or the user explicitly asks. Incorporate items naturally into sentences instead.
|
||||
</avoid_excessive_markdown>
|
||||
```
|
||||
|
||||
### LaTeX
|
||||
Claude Opus 4.6 defaults to LaTeX for math. To disable:
|
||||
```
|
||||
Format in plain text only. Do not use LaTeX, MathJax, or any markup. Write math with standard text (/, *, ^).
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tool Use and Agentic Systems
|
||||
|
||||
### Tool Definition Best Practices
|
||||
- Provide clear, detailed descriptions for each tool and parameter.
|
||||
- Specify exactly when the tool should and should not be used.
|
||||
- Clarify which parameters are required vs. optional.
|
||||
- Give examples of correct tool call patterns.
|
||||
|
||||
### Agentic System Prompts
|
||||
- Include explicit instructions for error handling, ambiguity, and when to ask for clarification.
|
||||
- Specify how to handle tool call failures.
|
||||
- Define scope boundaries — what the agent should and should not attempt.
|
||||
|
||||
---
|
||||
|
||||
## Model-Specific Notes
|
||||
|
||||
### Claude Opus 4.6
|
||||
- Best for: Complex multi-step reasoning, nuanced analysis, sophisticated code generation.
|
||||
- Defaults to LaTeX for math (disable if needed).
|
||||
- Most verbose by default — may need to prompt for conciseness.
|
||||
|
||||
### Claude Sonnet 4.6
|
||||
- Best for: General-purpose tasks, coding, analysis with good speed/quality balance.
|
||||
- More concise than Opus by default.
|
||||
- Strong instruction following.
|
||||
|
||||
### Claude Haiku 4.5
|
||||
- Best for: Simple tasks, quick responses, high-volume low-stakes workloads.
|
||||
- Fastest and cheapest Claude model.
|
||||
- May need tighter prompts for complex formatting.
|
||||
- Use `<example>` tags more liberally to guide behavior.
|
||||
|
||||
---
|
||||
|
||||
## Relevance for Our Subagent Prompts
|
||||
|
||||
### For Claude models (Haiku/Sonnet as advisors)
|
||||
- Always include a role statement.
|
||||
- Use XML tags to separate role, instructions, context, and topic.
|
||||
- Use examples wrapped in `<example>` tags for structured output formats.
|
||||
- Static instructions → early in prompt. Variable topic → at the end.
|
||||
|
||||
### For cheaper models (GLM-4.7, see separate reference)
|
||||
- Need even tighter prompting — be more explicit.
|
||||
- Structured output schemas > open-ended generation.
|
||||
- Constrained output length.
|
||||
56
memory/references/openai-prompt-caching.md
Normal file
56
memory/references/openai-prompt-caching.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# OpenAI — Prompt Caching Best Practices
|
||||
|
||||
**Source**: https://platform.openai.com/docs/guides/prompt-caching
|
||||
**Fetched**: 2026-03-05
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
- Caching is **automatic** — no code changes required, no extra fees.
|
||||
- Enabled for all prompts ≥ 1024 tokens.
|
||||
- Routes requests to servers that recently processed the same prompt prefix.
|
||||
- Cache hit: significantly reduced latency + lower cost.
|
||||
- Cache miss: full processing, prefix cached for future requests.
|
||||
|
||||
## Cache Retention Policies
|
||||
|
||||
### In-memory (default)
|
||||
- Available for ALL models supporting prompt caching (gpt-4o and newer).
|
||||
- Cached prefixes stay active for **5-10 minutes** of inactivity, up to **1 hour max**.
|
||||
- Held in volatile GPU memory.
|
||||
|
||||
### Extended (24h)
|
||||
- Available for: gpt-5.4, gpt-5.2, gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-chat-latest, gpt-5, gpt-5-codex, gpt-4.1
|
||||
- Keeps cached prefixes active up to **24 hours**.
|
||||
- Offloads KV tensors to GPU-local storage when memory is full.
|
||||
- Opt in per request: `"prompt_cache_retention": "24h"`.
|
||||
- NOT zero-data-retention eligible (unlike in-memory).
|
||||
|
||||
## What Can Be Cached
|
||||
- Messages array (system, user, assistant)
|
||||
- Images in user messages (must be identical, same `detail` parameter)
|
||||
- Tool definitions
|
||||
- Structured output schemas
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Static content first, dynamic content last**: Put system prompts, instructions, examples at beginning. Variable/user content at end.
|
||||
2. **Use `prompt_cache_key`**: Group requests that share common prefixes under the same key to improve routing and hit rates.
|
||||
3. **Stay under 15 req/min per prefix+key**: Above this rate, overflow requests go to new machines and miss cache.
|
||||
4. **Maintain steady request stream**: Cache evicts after inactivity. Regular requests keep cache warm.
|
||||
5. **Monitor `cached_tokens`** in `usage.prompt_tokens_details`: Track cache hit rates.
|
||||
|
||||
## Pricing
|
||||
- Cache writes: same as regular input tokens (no extra cost).
|
||||
- Cache reads: discounted (typically 50% of input price, varies by model).
|
||||
|
||||
## Verification
|
||||
Check `usage.prompt_tokens_details.cached_tokens` in responses to confirm cache is working.
|
||||
|
||||
## For Our Setup (OpenClaw)
|
||||
- Applies to: `litellm/copilot-gpt-*`, `litellm/gpt-*`, `litellm/o*` models.
|
||||
- Automatic — no OpenClaw config needed for basic caching on GPT models.
|
||||
- For 24h extended retention: need to pass `prompt_cache_retention: "24h"` in model params.
|
||||
- Minimum prompt size: 1024 tokens (our system prompt easily exceeds this).
|
||||
- Does NOT apply to Claude models (those use Anthropic's mechanism).
|
||||
156
memory/references/openai-prompting-best-practices.md
Normal file
156
memory/references/openai-prompting-best-practices.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# OpenAI — Prompt Engineering Best Practices
|
||||
|
||||
**Source**: https://platform.openai.com/docs/guides/prompt-engineering
|
||||
**Source**: https://platform.openai.com/docs/guides/optimizing-llm-accuracy
|
||||
**Fetched**: 2026-03-05
|
||||
**Applies to**: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models)
|
||||
|
||||
---
|
||||
|
||||
## Model Types and When to Use Each
|
||||
|
||||
| Model Type | Speed | Cost | Best For | Prompting Style |
|
||||
|------------|-------|------|----------|-----------------|
|
||||
| Reasoning (o3, o4-mini) | Slow | High | Complex multi-step, math, planning | Less instruction-heavy — model reasons internally |
|
||||
| Large GPT (gpt-5.2, gpt-4.1) | Medium | Medium | General tasks, coding, analysis | Explicit instructions work well |
|
||||
| Small GPT (gpt-5-mini, gpt-4.1-nano) | Fast | Low | Simple tasks, formatting, classification | More explicit instructions needed |
|
||||
|
||||
**When in doubt**: gpt-4.1 is the recommended balance of intelligence, speed, and cost.
|
||||
|
||||
**Important**: Reasoning models and GPT models need to be prompted differently:
|
||||
- Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally.
|
||||
- GPT models: Benefit from explicit step-by-step instructions ("think through this step by step").
|
||||
|
||||
---
|
||||
|
||||
## Message Roles and Priority
|
||||
|
||||
| Role | Priority | Purpose |
|
||||
|------|----------|---------|
|
||||
| `developer` | Highest | System rules, business logic, application-level instructions |
|
||||
| `user` | Medium | End-user inputs and requests |
|
||||
| `assistant` | — | Model-generated responses |
|
||||
|
||||
Note: `instructions` parameter in Responses API = top-level developer message, takes priority over `input`.
|
||||
|
||||
Important: `instructions` is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn).
|
||||
|
||||
---
|
||||
|
||||
## Core Prompt Engineering Techniques
|
||||
|
||||
### 1. Write Clear Instructions
|
||||
- Be explicit about desired format, length, tone, and constraints.
|
||||
- Provide context — WHY the instruction matters.
|
||||
- Specify what to do rather than only what not to do.
|
||||
- Use numbered steps when sequence matters.
|
||||
|
||||
### 2. Split Complex Tasks into Subtasks
|
||||
- Complex tasks are error-prone as single prompts.
|
||||
- Chain simpler prompts: classification → generation → verification.
|
||||
- Intent classification → routing to specialized prompts.
|
||||
- Summarize long conversations before sending to model.
|
||||
|
||||
### 3. Give the Model Time to "Think" (GPT models)
|
||||
- Ask the model to reason before answering: "Before answering, think through the problem step by step."
|
||||
- Ask the model to check its own reasoning: "Review your answer and identify any errors."
|
||||
- Ask for a chain of thought in a scratchpad before final output.
|
||||
|
||||
### 4. Provide Reference Text
|
||||
- Include documents, examples, or facts the model should use.
|
||||
- Instruct the model to answer ONLY based on provided context.
|
||||
- Ask it to quote from reference material when answering.
|
||||
|
||||
### 5. Use External Tools
|
||||
- Retrieval (RAG): when model lacks current or proprietary knowledge.
|
||||
- Code execution: for precise math, data analysis.
|
||||
- Function calling: for structured external actions.
|
||||
|
||||
### 6. Test Changes Systematically
|
||||
- Define eval criteria before changing prompts.
|
||||
- Test on diverse samples including edge cases.
|
||||
- Track performance metrics, don't rely on vibes.
|
||||
- Pin to specific model snapshots (e.g., `gpt-4.1-2025-04-14`) for production.
|
||||
|
||||
---
|
||||
|
||||
## Prompt Structure Best Practices
|
||||
|
||||
Recommended order in `developer` message:
|
||||
1. **Identity**: Purpose, communication style, high-level goals.
|
||||
2. **Instructions**: Rules, what to do and not do, output format.
|
||||
3. **Examples**: Few-shot examples (in `<example>` blocks or as messages).
|
||||
4. **Context/documents**: Reference material (with XML tags for clarity).
|
||||
5. **Delimiters**: Use markdown headers AND XML tags to delineate sections.
|
||||
|
||||
Use XML tags to separate document content from instructions:
|
||||
```xml
|
||||
<document>
|
||||
<source>filename.txt</source>
|
||||
<content>
|
||||
...
|
||||
</content>
|
||||
</document>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LLM Optimization Framework (from Optimizing LLM Accuracy guide)
|
||||
|
||||
### Two Axes of Optimization
|
||||
|
||||
**Context optimization** (right information in context):
|
||||
- Model lacks factual/domain knowledge → add RAG
|
||||
- Knowledge is outdated → use retrieval
|
||||
- Needs proprietary data → inject context
|
||||
|
||||
**LLM optimization** (consistent behavior):
|
||||
- Inconsistent output format → add examples (few-shot)
|
||||
- Wrong tone/style → adjust system prompt
|
||||
- Reasoning not followed → fine-tune
|
||||
|
||||
### Optimization Ladder
|
||||
1. **Start**: Simple prompt + evaluation set
|
||||
2. **Add static few-shot examples** → improves consistency
|
||||
3. **Add dynamic few-shot (RAG)** → improves accuracy for diverse inputs
|
||||
4. **Fine-tuning** → for high-volume tasks needing consistent style/format
|
||||
5. **Fact-checking step** → for accuracy on high-stakes tasks
|
||||
|
||||
### Evaluation Best Practices
|
||||
- Build eval set of 20+ Q&A pairs before advanced optimization.
|
||||
- Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment).
|
||||
- Separate evaluation on high-stakes "tail" queries from aggregate metrics.
|
||||
- Use evals to monitor prompt performance across model upgrades.
|
||||
|
||||
---
|
||||
|
||||
## Structured Outputs
|
||||
- Use `response_format: json_schema` to enforce JSON output schemas.
|
||||
- Eliminates format retries entirely.
|
||||
- Reduces output tokens (structured output is more concise than prose).
|
||||
- Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series.
|
||||
|
||||
---
|
||||
|
||||
## Relevance for Our Subagent Prompts
|
||||
|
||||
### For GPT models (copilot-gpt-* subagents)
|
||||
- Use `developer` role for system/role instructions.
|
||||
- Include few-shot examples for structured output tasks.
|
||||
- Use `response_format: json_schema` for any scored/structured council output.
|
||||
- For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate.
|
||||
- Reserve gpt-5.2+ for complex reasoning tasks.
|
||||
|
||||
### For reasoning models (o3, o4-mini)
|
||||
- Don't over-specify reasoning steps — model handles internally.
|
||||
- Use for tasks requiring deep analysis or multi-step planning.
|
||||
- Much slower and more expensive — use sparingly.
|
||||
|
||||
### Subagent model selection cheat sheet
|
||||
| Task | Recommended Model |
|
||||
|------|------------------|
|
||||
| Council advisors (opinion/brainstorm) | zai/glm-4.7 (free) or copilot-gpt-5-mini |
|
||||
| Council referee / synthesis | copilot-claude-sonnet-4.6 |
|
||||
| Code generation / review | copilot-claude-sonnet-4.6 or copilot-gpt-5.2 |
|
||||
| Simple formatting / classification | zai/glm-4.7-flash or copilot-gpt-5-nano |
|
||||
| Deep reasoning / architecture review | copilot-claude-opus-4.6 or o3 |
|
||||
Reference in New Issue
Block a user