Files

zap 23782735a1 docs(cost): update plan with corrections from official docs

- Phase 1: clarify cacheRetention only applies to Claude models; GPT auto-caches; GLM has none
- Phase 1: add TTL reality check (short=5min, long=1h) and implications for heartbeat timing
- Phase 2: explain why long TTL + 25m heartbeat is the right combo
- Phase 4: replace generic prompt tips with model-specific guidance from official Anthropic/OpenAI docs
- Added prompt structure notes for cache efficiency, GLM-4.7 tighter prompting requirements
- References: memory/references/*.md

2026-03-05 20:37:32 +00:00

18 KiB

Raw Permalink Blame History

Inference Cost Optimization Plan

Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.

Date: 2026-03-05 Status: Planning

Current State

Item	Value
Main session model	`litellm/copilot-claude-opus-4.6` (via GitHub Copilot)
Default agent model	`litellm/copilot-claude-sonnet-4.6`
Prompt caching	NOT SET (no `cacheRetention` configured)
Context pruning	NOT SET (no `contextPruning` configured)
Heartbeat	30m (main agent only)
Subagent model	Inherits session model (expensive!)
Free models available	`zai/glm-4.7`, `zai/glm-4.7-flash`, `zai/glm-4.7-flashx`, `zai/glm-5` (all $0)
Copilot models	Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token)

Cost Structure

Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.

Phase 1: Enable Prompt Caching

What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.

Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).

Applies only to Claude-backed models. GPT, GLM, and Gemini models do NOT support Anthropic's cacheRetention mechanism. OpenAI caching is automatic (no config needed). ZAI/GLM has no caching mechanism.

Cache TTL reality check (from official Anthropic docs):

short (cacheRetention) = 5-minute TTL (default — refreshed on each use within the window)
long = 1-hour TTL (at higher write cost: 2x base input price vs 1.25x for short)
Cache reads cost 0.1x (10%) of base input — so a cache hit on a 15k-token system prompt costs 90% less
First turn writes cache (slightly more expensive), subsequent turns read it (very cheap)

Implication: With 5-minute default TTL and a 30-minute heartbeat, the cache expires between every heartbeat. Either:

Use long (1h TTL) and set heartbeat to 55m to keep warm — best for cost savings
Use short (5m TTL) with no heartbeat adjustment — cache only helps within active bursts

Recommendation: Use long on main session Claude models + 25m heartbeat (well within 1h). Cache writes are slightly more expensive but the read savings dominate for any active session.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "models": {
        "litellm/copilot-claude-opus-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-opus-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-haiku-4.5": {
          "params": {
            "cacheRetention": "short"
          }
        }
      }
    }
  }
}

Note: No config needed for GPT models — OpenAI caches automatically for free on prompts ≥1024 tokens. No config available for ZAI/GLM models (no caching support).

Verification:

After applying, check /status or /usage full for cacheRead vs cacheWrite tokens.

Enable cache trace diagnostics temporarily:

{ "diagnostics": { "cacheTrace": { "enabled": true } } }

First turn will show high cacheWrite (populating cache). Subsequent turns should show high cacheRead with much lower cacheWrite.
Target: >60% cache hit rate within 2-3 turns of a session.

Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.

Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.

Phase 2: Heartbeat Cache Warming

What: Align heartbeat interval to keep the 1-hour prompt cache warm across idle gaps.

Why: With cacheRetention: "long" (1h TTL), the cache expires after 1 hour of no activity. A heartbeat just under 1h ensures the cache is touched before it expires, so the next real interaction reads from cache instead of rewriting it. Our current 30m heartbeat already works, but 25m gives a safety margin.

Important: Heartbeat keep-warm only applies to Claude models. GPT/GLM models don't benefit — their caching is either automatic (OpenAI) or non-existent (ZAI).

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "55m"
      }
    },
    "list": [
      {
        "id": "main",
        "heartbeat": {
          "every": "25m"
        }
      }
    ]
  }
}

Rationale:

Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)

Verification:

After a 30-minute idle gap, check that the next interaction shows cacheRead (not all cacheWrite).
Monitor heartbeat token cost via /usage full on a heartbeat response.

Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.

Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.

Phase 3: Context Pruning

What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.

Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "1h"
      }
    }
  }
}

Rationale:

cache-ttl mode: prunes old tool-result context after the cache TTL expires
ttl: "1h": matches Anthropic's long cache retention window
After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context

Verification:

Use /context list or /context detail to check context size before and after pruning.
After a >1h idle gap, verify the context window is smaller than before the gap.
Ensure no critical context is lost — compaction summaries should preserve key information.

Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.

Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).

Phase 4: Cheaper Models for Subagents

What: Route subagent tasks to cheaper models based on task complexity, with quality verification.

Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.

Model Tier Strategy

Tier	Model	Use Case	Cost
Free	`zai/glm-4.7`	Bulk subagent work: council advisors, brainstorming, summarization, classification	$0
Free-fast	`zai/glm-4.7-flash`	Simple/short subagent tasks: acknowledgments, formatting, quick lookups	$0
Cheap	`litellm/copilot-claude-haiku-4.5`	Tasks needing Claude quality but not heavy reasoning	Low quota
Standard	`litellm/copilot-claude-sonnet-4.6`	Tasks needing strong reasoning, code generation, analysis	Medium quota
Frontier	`litellm/copilot-claude-opus-4.6`	Only for: main session, referee/meta-arbiter, critical decisions	High quota

Implementation

4a. Council Skill — Default to GLM-4.7

Update council skill to use cheaper models by default:

Council Role	Default Model	Override for `tier=heavy`
Personality advisors	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
D/P Freethinkers	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
D/P Arbiters	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
Referee / Meta-Arbiter	`litellm/copilot-claude-sonnet-4.6`	`litellm/copilot-claude-opus-4.6`

When spawning subagents via sessions_spawn, pass the model parameter:

{
  "task": "...",
  "mode": "run",
  "label": "council-pragmatist",
  "model": "zai/glm-4.7"
}

4b. General Subagent Routing Guidelines

Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:

Use zai/glm-4.7 (free) when:

Task is well-defined with clear constraints
Output format is specified in the prompt
Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
Task doesn't require tool use or complex multi-step reasoning

Use litellm/copilot-claude-sonnet-4.6 (standard) when:

Task requires nuanced reasoning or analysis
Task involves code generation or review
Output quality is user-facing and high-stakes
Task requires understanding subtle context

Use litellm/copilot-claude-opus-4.6 (frontier) when:

Main interactive session only
Final synthesis / referee / meta-arbiter roles
Tasks where the user explicitly asked for highest quality

Subagent Model Notes

When spawning subagents for Claude tasks, caching applies if the subagent model is also a Claude model. But subagents are typically short-lived (single-turn mode=run), so caching benefit is minimal — they don't accumulate conversation history.
The main caching win is in the main session, which has a large, growing context across many turns.
For GLM-4.7 subagents: no caching benefit, but no cost either ($0 model). Prompts must be self-contained and tightly framed.
For GPT subagents: OpenAI caches automatically if prompt ≥1024 tokens, no action needed.

Prompt Structure for Maximum Cache Efficiency (Claude models)

Per official Anthropic best practices:

Static first, dynamic last: System prompt, role, instructions → then topic/task (dynamic part).
This structure is what OpenClaw already does (system prompt built once, user message varies).
OpenClaw handles the cache_control injection automatically via cacheRetention config.
Our prompts already follow this structure correctly.

Before switching council and subagents to GLM-4.7, run a quality comparison:

Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
Scoring rubric:
- Does the advisor stay in character? (yes/no)
- Is the output substantive (not generic platitudes)? (1-5)
- Does it follow word count guidance? (within 50% of target)
- Does it reference specific aspects of the topic? (1-5)
Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.

4d. Prompt Engineering Per Model — Official Best Practices

From official Anthropic and OpenAI docs (see memory/references/):

For Claude models (Haiku, Sonnet — subagent advisors)

Give an explicit role in the system prompt: You are the Skeptic advisor on a council...
Use XML tags to separate role, instructions, context, topic: <instructions>, <context>, <topic>
Put static instructions first, variable topic at the end (maximizes cache hit rate on repeated spawns)
3-5 <example> tags for structured output formats
Use "tell what to do" not "don't do X": Write in flowing prose not Don't use bullet points

For GLM-4.7 (free tier subagents)

Be MORE explicit than with Claude — GLM needs tighter constraints
Constrain output length tightly: "Respond in exactly 3 paragraphs" not "200-400 words"
Use numbered lists or explicit section headers in the prompt
Front-load the most critical instruction (role + constraint first, context second)
Include a format-check reminder: "Before responding, verify your output matches the format above"
Request structured output over open-ended generation when possible
Avoid complex multi-step reasoning chains — GLM handles simpler, well-defined tasks best

For GPT models (gpt-5-mini, gpt-4.1 subagents)

Include explicit step-by-step instructions (GPT benefits from "think step by step" guidance)
Use response_format: json_schema for any scored/structured output — eliminates format retries entirely
Use developer role for system/role instructions (higher priority than user)
Don't over-specify for reasoning models (o3, o4-mini) — they reason internally
Pin to specific model snapshots if quality consistency matters (gpt-4.1-2025-04-14)

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Apply all three config changes to ~/.openclaw/openclaw.json:

cacheRetention: "long" on Claude models
heartbeat.every: "25m" for main, "55m" default
contextPruning.mode: "cache-ttl" with ttl: "1h"

Restart gateway: openclaw gateway restart

Verify with /status and /usage full over next few interactions.

Step 2: Quality test GLM-4.7 for subagent work

Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.

Step 3: Update council skill for model tiers

If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.

Step 4: Update AGENTS.md with subagent routing guidelines

Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.

Step 5: Monitor and tune

Track cache hit rates over 1-2 days
Monitor if context pruning causes any information loss
Adjust heartbeat timing if cache misses are too frequent
Tune GLM-4.7 prompts based on observed output quality

Upstream Safety Rules

These are hard constraints. Any implementation that violates them is out of scope.

❌ Never do

Edit files under ~/.npm-global/lib/node_modules/openclaw/ directly (dist, src, docs)
Patch or monkey-patch OpenClaw's runtime code, even for emergencies (exception: the existing TUI patch has a tracked upstream PR — document any new ones immediately)
Add config keys not documented in OpenClaw's own docs (guessing at undocumented keys can silently break on upgrade)
Modify ~/.openclaw/openclaw.json in a way that would be overwritten or invalidated by openclaw update
Introduce any middleware, proxy, or hook that intercepts OpenClaw's internal request path

✅ Safe to do

Edit ~/.openclaw/openclaw.json using documented config knobs (agents, models, diagnostics, contextPruning, etc.)
Add/edit workspace files (~/.openclaw/workspace/) freely — these are never touched by OpenClaw updates
Install/update skills via clawhub — skills are workspace-local
Run openclaw gateway restart after config changes
Use openclaw update status / scripts/openclaw-update-safe.sh to check for upstream updates

Checking before applying

Before implementing any config change:

Verify the key exists in /home/openclaw/.npm-global/lib/node_modules/openclaw/docs/ or https://docs.openclaw.ai
If undocumented: skip it or open a question/issue — don't guess
After openclaw update, re-verify config keys still work (check gateway logs for config parse errors)

Update workflow

# Before updating OpenClaw
openclaw update status           # check what version is available
# Review changelog for breaking config changes
openclaw update                  # update (safe scripts handle local compat)
openclaw gateway restart         # restart to pick up new version
# Verify gateway health + session model still resolves correctly

What This Does NOT Change

No OpenClaw code changes: Everything is config-only in openclaw.json
No upstream divergence: All settings use documented OpenClaw config knobs
No new infrastructure: No proxy servers, routers, or middleware
Main session stays on Opus: Only subagents move to cheaper models
Fully reversible: Remove the config keys to revert to current behavior

Expected Combined Impact

Optimization	Estimated Savings	Confidence
Prompt caching	40-60% input token reduction	High
Cache warming via heartbeat	Maintains cache savings across idle	High
Context pruning	20-30% context size reduction for long sessions	Medium
Subagent model routing	60-80% subagent cost (free model for bulk work)	Medium (pending quality test)

Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.

18 KiB Raw Permalink Blame History

Inference Cost Optimization Plan

Current State

Cost Structure

Phase 1: Enable Prompt Caching

Phase 2: Heartbeat Cache Warming

Phase 3: Context Pruning

Phase 4: Cheaper Models for Subagents

Model Tier Strategy

Implementation

4a. Council Skill — Default to GLM-4.7

4b. General Subagent Routing Guidelines

Subagent Model Notes

Prompt Structure for Maximum Cache Efficiency (Claude models)

4d. Prompt Engineering Per Model — Official Best Practices

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Step 2: Quality test GLM-4.7 for subagent work

Step 3: Update council skill for model tiers

Step 4: Update AGENTS.md with subagent routing guidelines

Step 5: Monitor and tune

Upstream Safety Rules

❌ Never do

✅ Safe to do

Checking before applying

Update workflow

What This Does NOT Change

Expected Combined Impact

18 KiB

Raw Permalink Blame History