Files

zap 79e61f4528 docs(references): add Anthropic + OpenAI official best practices

- anthropic-prompt-caching.md: KV cache mechanics, TTLs, pricing, auto vs explicit
- openai-prompt-caching.md: automatic caching, in-memory vs 24h retention, prompt_cache_key
- anthropic-prompting-best-practices.md: clear instructions, XML tags, few-shot, model-specific notes
- openai-prompting-best-practices.md: message roles, optimization framework, structured outputs, model selection

Key findings:
- Anthropic caching: only for Claude models, 5m default TTL, 1h optional, 10% cost for reads
- OpenAI caching: automatic/free, 5-10min default, 24h extended for GPT-5+
- GLM/ZAI models: neither caching mechanism applies
- Subagent model routing table added to openai-prompting-best-practices.md

2026-03-05 20:34:38 +00:00

6.3 KiB

Raw Blame History

OpenAI — Prompt Engineering Best Practices

Source: https://platform.openai.com/docs/guides/prompt-engineering Source: https://platform.openai.com/docs/guides/optimizing-llm-accuracy Fetched: 2026-03-05 Applies to: GPT-4.1, GPT-5, GPT-5 mini, o-series (reasoning models)

Model Types and When to Use Each

Model Type	Speed	Cost	Best For	Prompting Style
Reasoning (o3, o4-mini)	Slow	High	Complex multi-step, math, planning	Less instruction-heavy — model reasons internally
Large GPT (gpt-5.2, gpt-4.1)	Medium	Medium	General tasks, coding, analysis	Explicit instructions work well
Small GPT (gpt-5-mini, gpt-4.1-nano)	Fast	Low	Simple tasks, formatting, classification	More explicit instructions needed

When in doubt: gpt-4.1 is the recommended balance of intelligence, speed, and cost.

Important: Reasoning models and GPT models need to be prompted differently:

Reasoning models: Don't over-specify step-by-step reasoning — model handles this internally.
GPT models: Benefit from explicit step-by-step instructions ("think through this step by step").

Message Roles and Priority

Role	Priority	Purpose
`developer`	Highest	System rules, business logic, application-level instructions
`user`	Medium	End-user inputs and requests
`assistant`	—	Model-generated responses

Note: instructions parameter in Responses API = top-level developer message, takes priority over input.

Important: instructions is per-request only — not carried over in conversation continuations (use message array for persistent instructions in multi-turn).

Core Prompt Engineering Techniques

1. Write Clear Instructions

Be explicit about desired format, length, tone, and constraints.
Provide context — WHY the instruction matters.
Specify what to do rather than only what not to do.
Use numbered steps when sequence matters.

2. Split Complex Tasks into Subtasks

Complex tasks are error-prone as single prompts.
Chain simpler prompts: classification → generation → verification.
Intent classification → routing to specialized prompts.
Summarize long conversations before sending to model.

3. Give the Model Time to "Think" (GPT models)

Ask the model to reason before answering: "Before answering, think through the problem step by step."
Ask the model to check its own reasoning: "Review your answer and identify any errors."
Ask for a chain of thought in a scratchpad before final output.

4. Provide Reference Text

Include documents, examples, or facts the model should use.
Instruct the model to answer ONLY based on provided context.
Ask it to quote from reference material when answering.

5. Use External Tools

Retrieval (RAG): when model lacks current or proprietary knowledge.
Code execution: for precise math, data analysis.
Function calling: for structured external actions.

6. Test Changes Systematically

Define eval criteria before changing prompts.
Test on diverse samples including edge cases.
Track performance metrics, don't rely on vibes.
Pin to specific model snapshots (e.g., gpt-4.1-2025-04-14) for production.

Prompt Structure Best Practices

Recommended order in developer message:

Identity: Purpose, communication style, high-level goals.
Instructions: Rules, what to do and not do, output format.
Examples: Few-shot examples (in <example> blocks or as messages).
Context/documents: Reference material (with XML tags for clarity).
Delimiters: Use markdown headers AND XML tags to delineate sections.

Use XML tags to separate document content from instructions:

<document>
  <source>filename.txt</source>
  <content>
    ...
  </content>
</document>

LLM Optimization Framework (from Optimizing LLM Accuracy guide)

Two Axes of Optimization

Context optimization (right information in context):

Model lacks factual/domain knowledge → add RAG
Knowledge is outdated → use retrieval
Needs proprietary data → inject context

LLM optimization (consistent behavior):

Inconsistent output format → add examples (few-shot)
Wrong tone/style → adjust system prompt
Reasoning not followed → fine-tune

Optimization Ladder

Start: Simple prompt + evaluation set
Add static few-shot examples → improves consistency
Add dynamic few-shot (RAG) → improves accuracy for diverse inputs
Fine-tuning → for high-volume tasks needing consistent style/format
Fact-checking step → for accuracy on high-stakes tasks

Evaluation Best Practices

Build eval set of 20+ Q&A pairs before advanced optimization.
Metrics: ROUGE (quick), BERTScore (semantic similarity), GPT-4 as evaluator (human-like judgment).
Separate evaluation on high-stakes "tail" queries from aggregate metrics.
Use evals to monitor prompt performance across model upgrades.

Structured Outputs

Use response_format: json_schema to enforce JSON output schemas.
Eliminates format retries entirely.
Reduces output tokens (structured output is more concise than prose).
Works with: GPT-4.1+, GPT-5, GPT-5 mini, o-series.

Relevance for Our Subagent Prompts

For GPT models (copilot-gpt-* subagents)

Use developer role for system/role instructions.
Include few-shot examples for structured output tasks.
Use response_format: json_schema for any scored/structured council output.
For simple advisory tasks: gpt-5-mini or gpt-4.1 is appropriate.
Reserve gpt-5.2+ for complex reasoning tasks.

For reasoning models (o3, o4-mini)

Don't over-specify reasoning steps — model handles internally.
Use for tasks requiring deep analysis or multi-step planning.
Much slower and more expensive — use sparingly.

Subagent model selection cheat sheet

Task	Recommended Model
Council advisors (opinion/brainstorm)	zai/glm-4.7 (free) or copilot-gpt-5-mini
Council referee / synthesis	copilot-claude-sonnet-4.6
Code generation / review	copilot-claude-sonnet-4.6 or copilot-gpt-5.2
Simple formatting / classification	zai/glm-4.7-flash or copilot-gpt-5-nano
Deep reasoning / architecture review	copilot-claude-opus-4.6 or o3

6.3 KiB Raw Blame History