# OpenAI — Prompt Engineering Best Practices **Source**: https://platform.openai.com/docs/guides/prompt-engineering **Source**: https://platform.openai.com/docs/guides/optimizing-llm-accuracy **Fetched**: 2026-03-05 **Applies to**: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models) --- ## Model Types and When to Use Each | Model Type | Speed | Cost | Best For | Prompting Style | |------------|-------|------|----------|-----------------| | Reasoning (o3, o4-mini) | Slow | High | Complex multi-step, math, planning | Less instruction-heavy — model reasons internally | | Large GPT (gpt-5.2, gpt-4.1) | Medium | Medium | General tasks, coding, analysis | Explicit instructions work well | | Small GPT (gpt-5-mini, gpt-4.1-nano) | Fast | Low | Simple tasks, formatting, classification | More explicit instructions needed | **When in doubt**: gpt-4.1 is the recommended balance of intelligence, speed, and cost. **Important**: Reasoning models and GPT models need to be prompted differently: - Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally. - GPT models: Benefit from explicit step-by-step instructions ("think through this step by step"). --- ## Message Roles and Priority | Role | Priority | Purpose | |------|----------|---------| | `developer` | Highest | System rules, business logic, application-level instructions | | `user` | Medium | End-user inputs and requests | | `assistant` | — | Model-generated responses | Note: `instructions` parameter in Responses API = top-level developer message, takes priority over `input`. Important: `instructions` is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn). --- ## Core Prompt Engineering Techniques ### 1. Write Clear Instructions - Be explicit about desired format, length, tone, and constraints. - Provide context — WHY the instruction matters. - Specify what to do rather than only what not to do. - Use numbered steps when sequence matters. ### 2. Split Complex Tasks into Subtasks - Complex tasks are error-prone as single prompts. - Chain simpler prompts: classification → generation → verification. - Intent classification → routing to specialized prompts. - Summarize long conversations before sending to model. ### 3. Give the Model Time to "Think" (GPT models) - Ask the model to reason before answering: "Before answering, think through the problem step by step." - Ask the model to check its own reasoning: "Review your answer and identify any errors." - Ask for a chain of thought in a scratchpad before final output. ### 4. Provide Reference Text - Include documents, examples, or facts the model should use. - Instruct the model to answer ONLY based on provided context. - Ask it to quote from reference material when answering. ### 5. Use External Tools - Retrieval (RAG): when model lacks current or proprietary knowledge. - Code execution: for precise math, data analysis. - Function calling: for structured external actions. ### 6. Test Changes Systematically - Define eval criteria before changing prompts. - Test on diverse samples including edge cases. - Track performance metrics, don't rely on vibes. - Pin to specific model snapshots (e.g., `gpt-4.1-2025-04-14`) for production. --- ## Prompt Structure Best Practices Recommended order in `developer` message: 1. **Identity**: Purpose, communication style, high-level goals. 2. **Instructions**: Rules, what to do and not do, output format. 3. **Examples**: Few-shot examples (in `` blocks or as messages). 4. **Context/documents**: Reference material (with XML tags for clarity). 5. **Delimiters**: Use markdown headers AND XML tags to delineate sections. Use XML tags to separate document content from instructions: ```xml filename.txt ... ``` --- ## LLM Optimization Framework (from Optimizing LLM Accuracy guide) ### Two Axes of Optimization **Context optimization** (right information in context): - Model lacks factual/domain knowledge → add RAG - Knowledge is outdated → use retrieval - Needs proprietary data → inject context **LLM optimization** (consistent behavior): - Inconsistent output format → add examples (few-shot) - Wrong tone/style → adjust system prompt - Reasoning not followed → fine-tune ### Optimization Ladder 1. **Start**: Simple prompt + evaluation set 2. **Add static few-shot examples** → improves consistency 3. **Add dynamic few-shot (RAG)** → improves accuracy for diverse inputs 4. **Fine-tuning** → for high-volume tasks needing consistent style/format 5. **Fact-checking step** → for accuracy on high-stakes tasks ### Evaluation Best Practices - Build eval set of 20+ Q&A pairs before advanced optimization. - Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment). - Separate evaluation on high-stakes "tail" queries from aggregate metrics. - Use evals to monitor prompt performance across model upgrades. --- ## Structured Outputs - Use `response_format: json_schema` to enforce JSON output schemas. - Eliminates format retries entirely. - Reduces output tokens (structured output is more concise than prose). - Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series. --- ## Relevance for Our Subagent Prompts ### For GPT models (copilot-gpt-* subagents) - Use `developer` role for system/role instructions. - Include few-shot examples for structured output tasks. - Use `response_format: json_schema` for any scored/structured council output. - For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate. - Reserve gpt-5.2+ for complex reasoning tasks. ### For reasoning models (o3, o4-mini) - Don't over-specify reasoning steps — model handles internally. - Use for tasks requiring deep analysis or multi-step planning. - Much slower and more expensive — use sparingly. ### Subagent model selection cheat sheet | Task | Recommended Model | |------|------------------| | Council advisors (opinion/brainstorm) | zai/glm-4.7 (free) or copilot-gpt-5-mini | | Council referee / synthesis | copilot-claude-sonnet-4.6 | | Code generation / review | copilot-claude-sonnet-4.6 or copilot-gpt-5.2 | | Simple formatting / classification | zai/glm-4.7-flash or copilot-gpt-5-nano | | Deep reasoning / architecture review | copilot-claude-opus-4.6 or o3 |