docs(cost): update plan with corrections from official docs

- Phase 1: clarify cacheRetention only applies to Claude models; GPT auto-caches; GLM has none - Phase 1: add TTL reality check (short=5min, long=1h) and implications for heartbeat timing - Phase 2: explain why long TTL + 25m heartbeat is the right combo - Phase 4: replace generic prompt tips with model-specific guidance from official Anthropic/OpenAI docs - Added prompt structure notes for cache efficiency, GLM-4.7 tighter prompting requirements - References: memory/references/*.md
2026-03-05 20:37:32 +00:00
parent 79e61f4528
commit 23782735a1
1 changed files with 58 additions and 10 deletions
@@ -33,6 +33,20 @@

 **Why**: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).

+**Applies only to Claude-backed models.** GPT, GLM, and Gemini models do NOT support Anthropic's `cacheRetention` mechanism. OpenAI caching is automatic (no config needed). ZAI/GLM has no caching mechanism.
+
+**Cache TTL reality check** (from official Anthropic docs):
+- `short` (`cacheRetention`) = **5-minute TTL** (default — refreshed on each use within the window)
+- `long` = **1-hour TTL** (at higher write cost: 2x base input price vs 1.25x for short)
+- Cache reads cost 0.1x (10%) of base input — so a cache hit on a 15k-token system prompt costs 90% less
+- First turn writes cache (slightly more expensive), subsequent turns read it (very cheap)
+
+**Implication**: With 5-minute default TTL and a 30-minute heartbeat, the cache expires between every heartbeat. Either:
+1. Use `long` (1h TTL) and set heartbeat to 55m to keep warm — best for cost savings
+2. Use `short` (5m TTL) with no heartbeat adjustment — cache only helps within active bursts
+
+**Recommendation**: Use `long` on main session Claude models + 25m heartbeat (well within 1h). Cache writes are slightly more expensive but the read savings dominate for any active session.
+
 **Config change** (`~/.openclaw/openclaw.json`):
 ```json
 {
@@ -70,6 +84,8 @@
 }
 ```

+Note: No config needed for GPT models — OpenAI caches automatically for free on prompts ≥1024 tokens. No config available for ZAI/GLM models (no caching support).
+
 **Verification**:
 1. After applying, check `/status` or `/usage full` for `cacheRead` vs `cacheWrite` tokens.
 2. Enable cache trace diagnostics temporarily:
@@ -87,9 +103,11 @@

 ## Phase 2: Heartbeat Cache Warming

-**What**: Align heartbeat interval to keep the prompt cache warm across idle gaps.
+**What**: Align heartbeat interval to keep the 1-hour prompt cache warm across idle gaps.

-**Why**: Anthropic's `long` cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes.
+**Why**: With `cacheRetention: "long"` (1h TTL), the cache expires after 1 hour of no activity. A heartbeat just under 1h ensures the cache is touched before it expires, so the next real interaction reads from cache instead of rewriting it. Our current 30m heartbeat already works, but 25m gives a safety margin.
+
+**Important**: Heartbeat keep-warm only applies to **Claude models**. GPT/GLM models don't benefit — their caching is either automatic (OpenAI) or non-existent (ZAI).

 **Config change** (`~/.openclaw/openclaw.json`):
 ```json
@@ -223,7 +241,20 @@ Encode these in AGENTS.md or a workspace convention file so all future subagent
 - Final synthesis / referee / meta-arbiter roles
 - Tasks where the user explicitly asked for highest quality

-#### 4c. Quality Verification Strategy
+### Subagent Model Notes
+
+- When spawning subagents for Claude tasks, caching applies if the subagent model is also a Claude model. But subagents are typically short-lived (single-turn `mode=run`), so caching benefit is minimal — they don't accumulate conversation history.
+- The main caching win is in the **main session**, which has a large, growing context across many turns.
+- For GLM-4.7 subagents: no caching benefit, but no cost either ($0 model). Prompts must be self-contained and tightly framed.
+- For GPT subagents: OpenAI caches automatically if prompt ≥1024 tokens, no action needed.
+
+### Prompt Structure for Maximum Cache Efficiency (Claude models)
+
+Per official Anthropic best practices:
+- **Static first, dynamic last**: System prompt, role, instructions → then topic/task (dynamic part).
+- This structure is what OpenClaw already does (system prompt built once, user message varies).
+- OpenClaw handles the `cache_control` injection automatically via `cacheRetention` config.
+- Our prompts already follow this structure correctly.

 Before switching council and subagents to GLM-4.7, run a quality comparison:

@@ -236,15 +267,32 @@ Before switching council and subagents to GLM-4.7, run a quality comparison:
   - Does it reference specific aspects of the topic? (1-5)
 4. **Minimum quality bar**: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.

-#### 4d. Prompt Engineering for Cheaper Models
+#### 4d. Prompt Engineering Per Model — Official Best Practices

-Cheaper models need tighter prompts to maintain quality. Key techniques:
+From official Anthropic and OpenAI docs (see `memory/references/`):

- **Be more explicit about output format**: Include examples, not just descriptions
- **Constrain output length more tightly**: "Respond in exactly 3 paragraphs" vs "200-400 words"
- **Use structured output requests**: Ask for numbered lists, specific headers
- **Front-load the most important instruction**: Put the role and constraint first, context second
- **Include a quality check instruction**: "Before responding, verify your output matches the requested format"
+**For Claude models (Haiku, Sonnet — subagent advisors)**
+- Give an explicit role in the system prompt: `You are the Skeptic advisor on a council...`
+- Use XML tags to separate role, instructions, context, topic: `<instructions>`, `<context>`, `<topic>`
+- Put static instructions first, variable topic at the end (maximizes cache hit rate on repeated spawns)
+- 3-5 `<example>` tags for structured output formats
+- Use "tell what to do" not "don't do X": `Write in flowing prose` not `Don't use bullet points`
+
+**For GLM-4.7 (free tier subagents)**
+- Be MORE explicit than with Claude — GLM needs tighter constraints
+- Constrain output length tightly: "Respond in exactly 3 paragraphs" not "200-400 words"
+- Use numbered lists or explicit section headers in the prompt
+- Front-load the most critical instruction (role + constraint first, context second)
+- Include a format-check reminder: "Before responding, verify your output matches the format above"
+- Request structured output over open-ended generation when possible
+- Avoid complex multi-step reasoning chains — GLM handles simpler, well-defined tasks best
+
+**For GPT models (gpt-5-mini, gpt-4.1 subagents)**
+- Include explicit step-by-step instructions (GPT benefits from "think step by step" guidance)
+- Use `response_format: json_schema` for any scored/structured output — eliminates format retries entirely
+- Use `developer` role for system/role instructions (higher priority than `user`)
+- Don't over-specify for reasoning models (o3, o4-mini) — they reason internally
+- Pin to specific model snapshots if quality consistency matters (`gpt-4.1-2025-04-14`)

 ---