Phase 1: Enable prompt caching (cacheRetention: long on Claude models) Phase 2: Heartbeat cache warming (25m main, 55m default) Phase 3: Context pruning (cache-ttl mode, 1h TTL) Phase 4: Cheaper models for subagents (GLM-4.7 free tier for bulk work) All config-only, no OpenClaw code changes, fully reversible.
13 KiB
Inference Cost Optimization Plan
Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.
Date: 2026-03-05 Status: Planning
Current State
| Item | Value |
|---|---|
| Main session model | litellm/copilot-claude-opus-4.6 (via GitHub Copilot) |
| Default agent model | litellm/copilot-claude-sonnet-4.6 |
| Prompt caching | NOT SET (no cacheRetention configured) |
| Context pruning | NOT SET (no contextPruning configured) |
| Heartbeat | 30m (main agent only) |
| Subagent model | Inherits session model (expensive!) |
| Free models available | zai/glm-4.7, zai/glm-4.7-flash, zai/glm-4.7-flashx, zai/glm-5 (all $0) |
| Copilot models | Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token) |
Cost Structure
- Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
- ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
- The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.
Phase 1: Enable Prompt Caching
What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.
Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"models": {
"litellm/copilot-claude-opus-4.6": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-sonnet-4.6": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-opus-4.5": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-sonnet-4.5": {
"params": {
"cacheRetention": "long"
}
},
"litellm/copilot-claude-haiku-4.5": {
"params": {
"cacheRetention": "short"
}
}
}
}
}
}
Verification:
- After applying, check
/statusor/usage fullforcacheReadvscacheWritetokens. - Enable cache trace diagnostics temporarily:
{ "diagnostics": { "cacheTrace": { "enabled": true } } } - First turn will show high
cacheWrite(populating cache). Subsequent turns should show highcacheReadwith much lowercacheWrite. - Target: >60% cache hit rate within 2-3 turns of a session.
Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.
Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.
Phase 2: Heartbeat Cache Warming
What: Align heartbeat interval to keep the prompt cache warm across idle gaps.
Why: Anthropic's long cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes.
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"heartbeat": {
"every": "55m"
}
},
"list": [
{
"id": "main",
"heartbeat": {
"every": "25m"
}
}
]
}
}
Rationale:
- Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
- Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
- If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)
Verification:
- After a 30-minute idle gap, check that the next interaction shows
cacheRead(not allcacheWrite). - Monitor heartbeat token cost via
/usage fullon a heartbeat response.
Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.
Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.
Phase 3: Context Pruning
What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.
Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.
Config change (~/.openclaw/openclaw.json):
{
"agents": {
"defaults": {
"contextPruning": {
"mode": "cache-ttl",
"ttl": "1h"
}
}
}
}
Rationale:
cache-ttlmode: prunes old tool-result context after the cache TTL expiresttl: "1h": matches Anthropic'slongcache retention window- After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context
Verification:
- Use
/context listor/context detailto check context size before and after pruning. - After a >1h idle gap, verify the context window is smaller than before the gap.
- Ensure no critical context is lost — compaction summaries should preserve key information.
Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.
Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).
Phase 4: Cheaper Models for Subagents
What: Route subagent tasks to cheaper models based on task complexity, with quality verification.
Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.
Model Tier Strategy
| Tier | Model | Use Case | Cost |
|---|---|---|---|
| Free | zai/glm-4.7 |
Bulk subagent work: council advisors, brainstorming, summarization, classification | $0 |
| Free-fast | zai/glm-4.7-flash |
Simple/short subagent tasks: acknowledgments, formatting, quick lookups | $0 |
| Cheap | litellm/copilot-claude-haiku-4.5 |
Tasks needing Claude quality but not heavy reasoning | Low quota |
| Standard | litellm/copilot-claude-sonnet-4.6 |
Tasks needing strong reasoning, code generation, analysis | Medium quota |
| Frontier | litellm/copilot-claude-opus-4.6 |
Only for: main session, referee/meta-arbiter, critical decisions | High quota |
Implementation
4a. Council Skill — Default to GLM-4.7
Update council skill to use cheaper models by default:
| Council Role | Default Model | Override for tier=heavy |
|---|---|---|
| Personality advisors | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| D/P Freethinkers | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| D/P Arbiters | zai/glm-4.7 |
litellm/copilot-claude-sonnet-4.6 |
| Referee / Meta-Arbiter | litellm/copilot-claude-sonnet-4.6 |
litellm/copilot-claude-opus-4.6 |
When spawning subagents via sessions_spawn, pass the model parameter:
{
"task": "...",
"mode": "run",
"label": "council-pragmatist",
"model": "zai/glm-4.7"
}
4b. General Subagent Routing Guidelines
Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:
Use zai/glm-4.7 (free) when:
- Task is well-defined with clear constraints
- Output format is specified in the prompt
- Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
- Task doesn't require tool use or complex multi-step reasoning
Use litellm/copilot-claude-sonnet-4.6 (standard) when:
- Task requires nuanced reasoning or analysis
- Task involves code generation or review
- Output quality is user-facing and high-stakes
- Task requires understanding subtle context
Use litellm/copilot-claude-opus-4.6 (frontier) when:
- Main interactive session only
- Final synthesis / referee / meta-arbiter roles
- Tasks where the user explicitly asked for highest quality
4c. Quality Verification Strategy
Before switching council and subagents to GLM-4.7, run a quality comparison:
- Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
- Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
- Scoring rubric:
- Does the advisor stay in character? (yes/no)
- Is the output substantive (not generic platitudes)? (1-5)
- Does it follow word count guidance? (within 50% of target)
- Does it reference specific aspects of the topic? (1-5)
- Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.
4d. Prompt Engineering for Cheaper Models
Cheaper models need tighter prompts to maintain quality. Key techniques:
- Be more explicit about output format: Include examples, not just descriptions
- Constrain output length more tightly: "Respond in exactly 3 paragraphs" vs "200-400 words"
- Use structured output requests: Ask for numbered lists, specific headers
- Front-load the most important instruction: Put the role and constraint first, context second
- Include a quality check instruction: "Before responding, verify your output matches the requested format"
Implementation Order
Step 1: Config changes (Phases 1-3) — Do together, single commit
Apply all three config changes to ~/.openclaw/openclaw.json:
cacheRetention: "long"on Claude modelsheartbeat.every: "25m"for main,"55m"defaultcontextPruning.mode: "cache-ttl"withttl: "1h"
Restart gateway: openclaw gateway restart
Verify with /status and /usage full over next few interactions.
Step 2: Quality test GLM-4.7 for subagent work
Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.
Step 3: Update council skill for model tiers
If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.
Step 4: Update AGENTS.md with subagent routing guidelines
Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.
Step 5: Monitor and tune
- Track cache hit rates over 1-2 days
- Monitor if context pruning causes any information loss
- Adjust heartbeat timing if cache misses are too frequent
- Tune GLM-4.7 prompts based on observed output quality
What This Does NOT Change
- No OpenClaw code changes: Everything is config-only in
openclaw.json - No upstream divergence: All settings use documented OpenClaw config knobs
- No new infrastructure: No proxy servers, routers, or middleware
- Main session stays on Opus: Only subagents move to cheaper models
- Fully reversible: Remove the config keys to revert to current behavior
Expected Combined Impact
| Optimization | Estimated Savings | Confidence |
|---|---|---|
| Prompt caching | 40-60% input token reduction | High |
| Cache warming via heartbeat | Maintains cache savings across idle | High |
| Context pruning | 20-30% context size reduction for long sessions | Medium |
| Subagent model routing | 60-80% subagent cost (free model for bulk work) | Medium (pending quality test) |
Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.