- Phase 1: clarify cacheRetention only applies to Claude models; GPT auto-caches; GLM has none - Phase 1: add TTL reality check (short=5min, long=1h) and implications for heartbeat timing - Phase 2: explain why long TTL + 25m heartbeat is the right combo - Phase 4: replace generic prompt tips with model-specific guidance from official Anthropic/OpenAI docs - Added prompt structure notes for cache efficiency, GLM-4.7 tighter prompting requirements - References: memory/references/*.md
18 KiB
Inference Cost Optimization Plan
Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.
Date: 2026-03-05 Status: Planning
Current State
| Item | Value |
|---|---|
| Main session model | litellm/copilot-claude-opus-4.6 (via GitHub Copilot) |
| Default agent model | litellm/copilot-claude-sonnet-4.6 |
| Prompt caching | NOT SET (no cacheRetention configured) |
| Context pruning | NOT SET (no contextPruning configured) |
| Heartbeat | 30m (main agent only) |
| Subagent model | Inherits session model (expensive!) |
| Free models available | zai/glm-4.7, zai/glm-4.7-flash, zai/glm-4.7-flashx, zai/glm-5 (all $0) |
| Copilot models | Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token) |
Cost Structure
- Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
- ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
- The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.
Phase 1: Enable Prompt Caching
What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.
Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).
Applies only to Claude-backed models. GPT, GLM, and Gemini models do NOT support Anthropic's cacheRetention mechanism. OpenAI caching is automatic (no config needed). ZAI/GLM has no caching mechanism.
Cache TTL reality check (from official Anthropic docs):
short(cacheRetention) = 5-minute TTL (default — refreshed on each use within the window)long= 1-hour TTL (at higher write cost: 2x base input price vs 1.25x for short)- Cache reads cost 0.1x (10%) of base input — so a cache hit on a 15k-token system prompt costs 90% less
- First turn writes cache (slightly more expensive), subsequent turns read it (very cheap)
Implication: With 5-minute default TTL and a 30-minute heartbeat, the cache expires between every heartbeat. Either:
- Use
long(1h TTL) and set heartbeat to 55m to keep warm — best for cost savings - Use
short(5m TTL) with no heartbeat adjustment — cache only helps within active bursts
Recommendation: Use long on main session Claude models + 25m heartbeat (well within 1h). Cache writes are slightly more expensive but the read savings dominate for any active session.
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"models": {
"litellm/copilot-claude-opus-4.6": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-sonnet-4.6": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-opus-4.5": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-sonnet-4.5": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-haiku-4.5": {
"params": {
"cacheRetention": "short"
}
}
}
}
}
}
Note: No config needed for GPT models — OpenAI caches automatically for free on prompts ≥1024 tokens. No config available for ZAI/GLM models (no caching support).
Verification:
- After applying, check
/statusor/usage fullforcacheReadvscacheWritetokens. - Enable cache trace diagnostics temporarily:
{ "diagnostics": { "cacheTrace": { "enabled": true } } } - First turn will show high
cacheWrite(populating cache). Subsequent turns should show highcacheReadwith much lowercacheWrite. - Target: >60% cache hit rate within 2-3 turns of a session.
Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.
Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.
Phase 2: Heartbeat Cache Warming
What: Align heartbeat interval to keep the 1-hour prompt cache warm across idle gaps.
Why: With cacheRetention: "long" (1h TTL), the cache expires after 1 hour of no activity. A heartbeat just under 1h ensures the cache is touched before it expires, so the next real interaction reads from cache instead of rewriting it. Our current 30m heartbeat already works, but 25m gives a safety margin.
Important: Heartbeat keep-warm only applies to Claude models. GPT/GLM models don't benefit — their caching is either automatic (OpenAI) or non-existent (ZAI).
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"heartbeat": {
"every": "55m"
}
},
"list": [
{
"id": "main",
"heartbeat": {
"every": "25m"
}
}
]
}
}
Rationale:
- Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
- Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
- If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)
Verification:
- After a 30-minute idle gap, check that the next interaction shows
cacheRead(not allcacheWrite). - Monitor heartbeat token cost via
/usage fullon a heartbeat response.
Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.
Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.
Phase 3: Context Pruning
What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.
Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"contextPruning": {
"mode": "cache-ttl",
"ttl": "1h"
}
}
}
}
Rationale:
cache-ttlmode: prunes old tool-result context after the cache TTL expiresttl: "1h": matches Anthropic'slongcache retention window- After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context
Verification:
- Use
/context listor/context detailto check context size before and after pruning. - After a >1h idle gap, verify the context window is smaller than before the gap.
- Ensure no critical context is lost — compaction summaries should preserve key information.
Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.
Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).
Phase 4: Cheaper Models for Subagents
What: Route subagent tasks to cheaper models based on task complexity, with quality verification.
Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.
Model Tier Strategy
| Tier | Model | Use Case | Cost |
|---|---|---|---|
| Free | zai/glm-4.7 |
Bulk subagent work: council advisors, brainstorming, summarization, classification | $0 |
| Free-fast | zai/glm-4.7-flash |
Simple/short subagent tasks: acknowledgments, formatting, quick lookups | $0 |
| Cheap | litellm/copilot-claude-haiku-4.5 |
Tasks needing Claude quality but not heavy reasoning | Low quota |
| Standard | litellm/copilot-claude-sonnet-4.6 |
Tasks needing strong reasoning, code generation, analysis | Medium quota |
| Frontier | litellm/copilot-claude-opus-4.6 |
Only for: main session, referee/meta-arbiter, critical decisions | High quota |
Implementation
4a. Council Skill — Default to GLM-4.7
Update council skill to use cheaper models by default:
| Council Role | Default Model | Override for tier=heavy |
|---|---|---|
| Personality advisors | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| D/P Freethinkers | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| D/P Arbiters | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| Referee / Meta-Arbiter | litellm/copilot-claude-sonnet-4.6 |
litellm/copilot-claude-opus-4.6 |
When spawning subagents via sessions_spawn, pass the model parameter:
{
"task": "...",
"mode": "run",
"label": "council-pragmatist",
"model": "zai/glm-4.7"
}
4b. General Subagent Routing Guidelines
Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:
Use zai/glm-4.7 (free) when:
- Task is well-defined with clear constraints
- Output format is specified in the prompt
- Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
- Task doesn't require tool use or complex multi-step reasoning
Use litellm/copilot-claude-sonnet-4.6 (standard) when:
- Task requires nuanced reasoning or analysis
- Task involves code generation or review
- Output quality is user-facing and high-stakes
- Task requires understanding subtle context
Use litellm/copilot-claude-opus-4.6 (frontier) when:
- Main interactive session only
- Final synthesis / referee / meta-arbiter roles
- Tasks where the user explicitly asked for highest quality
Subagent Model Notes
- When spawning subagents for Claude tasks, caching applies if the subagent model is also a Claude model. But subagents are typically short-lived (single-turn
mode=run), so caching benefit is minimal — they don't accumulate conversation history. - The main caching win is in the main session, which has a large, growing context across many turns.
- For GLM-4.7 subagents: no caching benefit, but no cost either ($0 model). Prompts must be self-contained and tightly framed.
- For GPT subagents: OpenAI caches automatically if prompt ≥1024 tokens, no action needed.
Prompt Structure for Maximum Cache Efficiency (Claude models)
Per official Anthropic best practices:
- Static first, dynamic last: System prompt, role, instructions → then topic/task (dynamic part).
- This structure is what OpenClaw already does (system prompt built once, user message varies).
- OpenClaw handles the
cache_controlinjection automatically viacacheRetentionconfig. - Our prompts already follow this structure correctly.
Before switching council and subagents to GLM-4.7, run a quality comparison:
- Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
- Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
- Scoring rubric:
- Does the advisor stay in character? (yes/no)
- Is the output substantive (not generic platitudes)? (1-5)
- Does it follow word count guidance? (within 50% of target)
- Does it reference specific aspects of the topic? (1-5)
- Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.
4d. Prompt Engineering Per Model — Official Best Practices
From official Anthropic and OpenAI docs (see memory/references/):
For Claude models (Haiku, Sonnet — subagent advisors)
- Give an explicit role in the system prompt:
You are the Skeptic advisor on a council... - Use XML tags to separate role, instructions, context, topic:
<instructions>,<context>,<topic> - Put static instructions first, variable topic at the end (maximizes cache hit rate on repeated spawns)
- 3-5
<example>tags for structured output formats - Use "tell what to do" not "don't do X":
Write in flowing prosenotDon't use bullet points
For GLM-4.7 (free tier subagents)
- Be MORE explicit than with Claude — GLM needs tighter constraints
- Constrain output length tightly: "Respond in exactly 3 paragraphs" not "200-400 words"
- Use numbered lists or explicit section headers in the prompt
- Front-load the most critical instruction (role + constraint first, context second)
- Include a format-check reminder: "Before responding, verify your output matches the format above"
- Request structured output over open-ended generation when possible
- Avoid complex multi-step reasoning chains — GLM handles simpler, well-defined tasks best
For GPT models (gpt-5-mini, gpt-4.1 subagents)
- Include explicit step-by-step instructions (GPT benefits from "think step by step" guidance)
- Use
response_format: json_schemafor any scored/structured output — eliminates format retries entirely - Use
developerrole for system/role instructions (higher priority thanuser) - Don't over-specify for reasoning models (o3, o4-mini) — they reason internally
- Pin to specific model snapshots if quality consistency matters (
gpt-4.1-2025-04-14)
Implementation Order
Step 1: Config changes (Phases 1-3) — Do together, single commit
Apply all three config changes to ~/.openclaw/openclaw.json:
cacheRetention: "long"on Claude modelsheartbeat.every: "25m"for main,"55m"defaultcontextPruning.mode: "cache-ttl"withttl: "1h"
Restart gateway: openclaw gateway restart
Verify with /status and /usage full over next few interactions.
Step 2: Quality test GLM-4.7 for subagent work
Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.
Step 3: Update council skill for model tiers
If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.
Step 4: Update AGENTS.md with subagent routing guidelines
Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.
Step 5: Monitor and tune
- Track cache hit rates over 1-2 days
- Monitor if context pruning causes any information loss
- Adjust heartbeat timing if cache misses are too frequent
- Tune GLM-4.7 prompts based on observed output quality
Upstream Safety Rules
These are hard constraints. Any implementation that violates them is out of scope.
❌ Never do
- Edit files under
~/.npm-global/lib/node_modules/openclaw/directly (dist, src, docs) - Patch or monkey-patch OpenClaw's runtime code, even for emergencies (exception: the existing TUI patch has a tracked upstream PR — document any new ones immediately)
- Add config keys not documented in OpenClaw's own docs (guessing at undocumented keys can silently break on upgrade)
- Modify
~/.openclaw/openclaw.jsonin a way that would be overwritten or invalidated byopenclaw update - Introduce any middleware, proxy, or hook that intercepts OpenClaw's internal request path
✅ Safe to do
- Edit
~/.openclaw/openclaw.jsonusing documented config knobs (agents, models, diagnostics, contextPruning, etc.) - Add/edit workspace files (
~/.openclaw/workspace/) freely — these are never touched by OpenClaw updates - Install/update skills via
clawhub— skills are workspace-local - Run
openclaw gateway restartafter config changes - Use
openclaw update status/scripts/openclaw-update-safe.shto check for upstream updates
Checking before applying
Before implementing any config change:
- Verify the key exists in
/home/openclaw/.npm-global/lib/node_modules/openclaw/docs/orhttps://docs.openclaw.ai - If undocumented: skip it or open a question/issue — don't guess
- After
openclaw update, re-verify config keys still work (check gateway logs for config parse errors)
Update workflow
# Before updating OpenClaw
openclaw update status # check what version is available
# Review changelog for breaking config changes
openclaw update # update (safe scripts handle local compat)
openclaw gateway restart # restart to pick up new version
# Verify gateway health + session model still resolves correctly
What This Does NOT Change
- No OpenClaw code changes: Everything is config-only in
openclaw.json - No upstream divergence: All settings use documented OpenClaw config knobs
- No new infrastructure: No proxy servers, routers, or middleware
- Main session stays on Opus: Only subagents move to cheaper models
- Fully reversible: Remove the config keys to revert to current behavior
Expected Combined Impact
| Optimization | Estimated Savings | Confidence |
|---|---|---|
| Prompt caching | 40-60% input token reduction | High |
| Cache warming via heartbeat | Maintains cache savings across idle | High |
| Context pruning | 20-30% context size reduction for long sessions | Medium |
| Subagent model routing | 60-80% subagent cost (free model for bulk work) | Medium (pending quality test) |
Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.