- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit - openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key - anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes - openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection Key findings: - Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads - OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+ - GLM/ZAI models: neither caching mechanism applies - Subagent model routing table added to openai-prompting-best-practices.md
157 lines
6.3 KiB
Markdown
157 lines
6.3 KiB
Markdown
# OpenAI — Prompt Engineering Best Practices
|
|
|
|
**Source**: https://platform.openai.com/docs/guides/prompt-engineering
|
|
**Source**: https://platform.openai.com/docs/guides/optimizing-llm-accuracy
|
|
**Fetched**: 2026-03-05
|
|
**Applies to**: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models)
|
|
|
|
---
|
|
|
|
## Model Types and When to Use Each
|
|
|
|
| Model Type | Speed | Cost | Best For | Prompting Style |
|
|
|------------|-------|------|----------|-----------------|
|
|
| Reasoning (o3, o4-mini) | Slow | High | Complex multi-step, math, planning | Less instruction-heavy — model reasons internally |
|
|
| Large GPT (gpt-5.2, gpt-4.1) | Medium | Medium | General tasks, coding, analysis | Explicit instructions work well |
|
|
| Small GPT (gpt-5-mini, gpt-4.1-nano) | Fast | Low | Simple tasks, formatting, classification | More explicit instructions needed |
|
|
|
|
**When in doubt**: gpt-4.1 is the recommended balance of intelligence, speed, and cost.
|
|
|
|
**Important**: Reasoning models and GPT models need to be prompted differently:
|
|
- Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally.
|
|
- GPT models: Benefit from explicit step-by-step instructions ("think through this step by step").
|
|
|
|
---
|
|
|
|
## Message Roles and Priority
|
|
|
|
| Role | Priority | Purpose |
|
|
|------|----------|---------|
|
|
| `developer` | Highest | System rules, business logic, application-level instructions |
|
|
| `user` | Medium | End-user inputs and requests |
|
|
| `assistant` | — | Model-generated responses |
|
|
|
|
Note: `instructions` parameter in Responses API = top-level developer message, takes priority over `input`.
|
|
|
|
Important: `instructions` is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn).
|
|
|
|
---
|
|
|
|
## Core Prompt Engineering Techniques
|
|
|
|
### 1. Write Clear Instructions
|
|
- Be explicit about desired format, length, tone, and constraints.
|
|
- Provide context — WHY the instruction matters.
|
|
- Specify what to do rather than only what not to do.
|
|
- Use numbered steps when sequence matters.
|
|
|
|
### 2. Split Complex Tasks into Subtasks
|
|
- Complex tasks are error-prone as single prompts.
|
|
- Chain simpler prompts: classification → generation → verification.
|
|
- Intent classification → routing to specialized prompts.
|
|
- Summarize long conversations before sending to model.
|
|
|
|
### 3. Give the Model Time to "Think" (GPT models)
|
|
- Ask the model to reason before answering: "Before answering, think through the problem step by step."
|
|
- Ask the model to check its own reasoning: "Review your answer and identify any errors."
|
|
- Ask for a chain of thought in a scratchpad before final output.
|
|
|
|
### 4. Provide Reference Text
|
|
- Include documents, examples, or facts the model should use.
|
|
- Instruct the model to answer ONLY based on provided context.
|
|
- Ask it to quote from reference material when answering.
|
|
|
|
### 5. Use External Tools
|
|
- Retrieval (RAG): when model lacks current or proprietary knowledge.
|
|
- Code execution: for precise math, data analysis.
|
|
- Function calling: for structured external actions.
|
|
|
|
### 6. Test Changes Systematically
|
|
- Define eval criteria before changing prompts.
|
|
- Test on diverse samples including edge cases.
|
|
- Track performance metrics, don't rely on vibes.
|
|
- Pin to specific model snapshots (e.g., `gpt-4.1-2025-04-14`) for production.
|
|
|
|
---
|
|
|
|
## Prompt Structure Best Practices
|
|
|
|
Recommended order in `developer` message:
|
|
1. **Identity**: Purpose, communication style, high-level goals.
|
|
2. **Instructions**: Rules, what to do and not do, output format.
|
|
3. **Examples**: Few-shot examples (in `<example>` blocks or as messages).
|
|
4. **Context/documents**: Reference material (with XML tags for clarity).
|
|
5. **Delimiters**: Use markdown headers AND XML tags to delineate sections.
|
|
|
|
Use XML tags to separate document content from instructions:
|
|
```xml
|
|
<document>
|
|
<source>filename.txt</source>
|
|
<content>
|
|
...
|
|
</content>
|
|
</document>
|
|
```
|
|
|
|
---
|
|
|
|
## LLM Optimization Framework (from Optimizing LLM Accuracy guide)
|
|
|
|
### Two Axes of Optimization
|
|
|
|
**Context optimization** (right information in context):
|
|
- Model lacks factual/domain knowledge → add RAG
|
|
- Knowledge is outdated → use retrieval
|
|
- Needs proprietary data → inject context
|
|
|
|
**LLM optimization** (consistent behavior):
|
|
- Inconsistent output format → add examples (few-shot)
|
|
- Wrong tone/style → adjust system prompt
|
|
- Reasoning not followed → fine-tune
|
|
|
|
### Optimization Ladder
|
|
1. **Start**: Simple prompt + evaluation set
|
|
2. **Add static few-shot examples** → improves consistency
|
|
3. **Add dynamic few-shot (RAG)** → improves accuracy for diverse inputs
|
|
4. **Fine-tuning** → for high-volume tasks needing consistent style/format
|
|
5. **Fact-checking step** → for accuracy on high-stakes tasks
|
|
|
|
### Evaluation Best Practices
|
|
- Build eval set of 20+ Q&A pairs before advanced optimization.
|
|
- Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment).
|
|
- Separate evaluation on high-stakes "tail" queries from aggregate metrics.
|
|
- Use evals to monitor prompt performance across model upgrades.
|
|
|
|
---
|
|
|
|
## Structured Outputs
|
|
- Use `response_format: json_schema` to enforce JSON output schemas.
|
|
- Eliminates format retries entirely.
|
|
- Reduces output tokens (structured output is more concise than prose).
|
|
- Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series.
|
|
|
|
---
|
|
|
|
## Relevance for Our Subagent Prompts
|
|
|
|
### For GPT models (copilot-gpt-* subagents)
|
|
- Use `developer` role for system/role instructions.
|
|
- Include few-shot examples for structured output tasks.
|
|
- Use `response_format: json_schema` for any scored/structured council output.
|
|
- For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate.
|
|
- Reserve gpt-5.2+ for complex reasoning tasks.
|
|
|
|
### For reasoning models (o3, o4-mini)
|
|
- Don't over-specify reasoning steps — model handles internally.
|
|
- Use for tasks requiring deep analysis or multi-step planning.
|
|
- Much slower and more expensive — use sparingly.
|
|
|
|
### Subagent model selection cheat sheet
|
|
| Task | Recommended Model |
|
|
|------|------------------|
|
|
| Council advisors (opinion/brainstorm) | zai/glm-4.7 (free) or copilot-gpt-5-mini |
|
|
| Council referee / synthesis | copilot-claude-sonnet-4.6 |
|
|
| Code generation / review | copilot-claude-sonnet-4.6 or copilot-gpt-5.2 |
|
|
| Simple formatting / classification | zai/glm-4.7-flash or copilot-gpt-5-nano |
|
|
| Deep reasoning / architecture review | copilot-claude-opus-4.6 or o3 |
|