Files
swarm-zap/memory/references/openai-prompting-best-practices.md
zap 79e61f4528 docs(references): add Anthropic + OpenAI official best practices
- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit
- openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key
- anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes
- openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection

Key findings:
- Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads
- OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+
- GLM/ZAI models: neither caching mechanism applies
- Subagent model routing table added to openai-prompting-best-practices.md
2026-03-05 20:34:38 +00:00

6.3 KiB

OpenAI — Prompt Engineering Best Practices

Source: https://platform.openai.com/docs/guides/prompt-engineering Source: https://platform.openai.com/docs/guides/optimizing-llm-accuracy Fetched: 2026-03-05 Applies to: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models)


Model Types and When to Use Each

Model Type Speed Cost Best For Prompting Style
Reasoning (o3, o4-mini) Slow High Complex multi-step, math, planning Less instruction-heavy — model reasons internally
Large GPT (gpt-5.2, gpt-4.1) Medium Medium General tasks, coding, analysis Explicit instructions work well
Small GPT (gpt-5-mini, gpt-4.1-nano) Fast Low Simple tasks, formatting, classification More explicit instructions needed

When in doubt: gpt-4.1 is the recommended balance of intelligence, speed, and cost.

Important: Reasoning models and GPT models need to be prompted differently:

  • Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally.
  • GPT models: Benefit from explicit step-by-step instructions ("think through this step by step").

Message Roles and Priority

Role Priority Purpose
developer Highest System rules, business logic, application-level instructions
user Medium End-user inputs and requests
assistant Model-generated responses

Note: instructions parameter in Responses API = top-level developer message, takes priority over input.

Important: instructions is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn).


Core Prompt Engineering Techniques

1. Write Clear Instructions

  • Be explicit about desired format, length, tone, and constraints.
  • Provide context — WHY the instruction matters.
  • Specify what to do rather than only what not to do.
  • Use numbered steps when sequence matters.

2. Split Complex Tasks into Subtasks

  • Complex tasks are error-prone as single prompts.
  • Chain simpler prompts: classification → generation → verification.
  • Intent classification → routing to specialized prompts.
  • Summarize long conversations before sending to model.

3. Give the Model Time to "Think" (GPT models)

  • Ask the model to reason before answering: "Before answering, think through the problem step by step."
  • Ask the model to check its own reasoning: "Review your answer and identify any errors."
  • Ask for a chain of thought in a scratchpad before final output.

4. Provide Reference Text

  • Include documents, examples, or facts the model should use.
  • Instruct the model to answer ONLY based on provided context.
  • Ask it to quote from reference material when answering.

5. Use External Tools

  • Retrieval (RAG): when model lacks current or proprietary knowledge.
  • Code execution: for precise math, data analysis.
  • Function calling: for structured external actions.

6. Test Changes Systematically

  • Define eval criteria before changing prompts.
  • Test on diverse samples including edge cases.
  • Track performance metrics, don't rely on vibes.
  • Pin to specific model snapshots (e.g., gpt-4.1-2025-04-14) for production.

Prompt Structure Best Practices

Recommended order in developer message:

  1. Identity: Purpose, communication style, high-level goals.
  2. Instructions: Rules, what to do and not do, output format.
  3. Examples: Few-shot examples (in <example> blocks or as messages).
  4. Context/documents: Reference material (with XML tags for clarity).
  5. Delimiters: Use markdown headers AND XML tags to delineate sections.

Use XML tags to separate document content from instructions:

<document>
  <source>filename.txt</source>
  <content>
    ...
  </content>
</document>

LLM Optimization Framework (from Optimizing LLM Accuracy guide)

Two Axes of Optimization

Context optimization (right information in context):

  • Model lacks factual/domain knowledge → add RAG
  • Knowledge is outdated → use retrieval
  • Needs proprietary data → inject context

LLM optimization (consistent behavior):

  • Inconsistent output format → add examples (few-shot)
  • Wrong tone/style → adjust system prompt
  • Reasoning not followed → fine-tune

Optimization Ladder

  1. Start: Simple prompt + evaluation set
  2. Add static few-shot examples → improves consistency
  3. Add dynamic few-shot (RAG) → improves accuracy for diverse inputs
  4. Fine-tuning → for high-volume tasks needing consistent style/format
  5. Fact-checking step → for accuracy on high-stakes tasks

Evaluation Best Practices

  • Build eval set of 20+ Q&A pairs before advanced optimization.
  • Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment).
  • Separate evaluation on high-stakes "tail" queries from aggregate metrics.
  • Use evals to monitor prompt performance across model upgrades.

Structured Outputs

  • Use response_format: json_schema to enforce JSON output schemas.
  • Eliminates format retries entirely.
  • Reduces output tokens (structured output is more concise than prose).
  • Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series.

Relevance for Our Subagent Prompts

For GPT models (copilot-gpt-* subagents)

  • Use developer role for system/role instructions.
  • Include few-shot examples for structured output tasks.
  • Use response_format: json_schema for any scored/structured council output.
  • For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate.
  • Reserve gpt-5.2+ for complex reasoning tasks.

For reasoning models (o3, o4-mini)

  • Don't over-specify reasoning steps — model handles internally.
  • Use for tasks requiring deep analysis or multi-step planning.
  • Much slower and more expensive — use sparingly.

Subagent model selection cheat sheet

Task Recommended Model
Council advisors (opinion/brainstorm) zai/glm-4.7 (free) or copilot-gpt-5-mini
Council referee / synthesis copilot-claude-sonnet-4.6
Code generation / review copilot-claude-sonnet-4.6 or copilot-gpt-5.2
Simple formatting / classification zai/glm-4.7-flash or copilot-gpt-5-nano
Deep reasoning / architecture review copilot-claude-opus-4.6 or o3