Files

zap 6642964ae6 docs(cost): add inference cost optimization plan — 4 phases

Phase 1: Enable prompt caching (cacheRetention: long on Claude models)
Phase 2: Heartbeat cache warming (25m main, 55m default)
Phase 3: Context pruning (cache-ttl mode, 1h TTL)
Phase 4: Cheaper models for subagents (GLM-4.7 free tier for bulk work)

All config-only, no OpenClaw code changes, fully reversible.

2026-03-05 20:20:03 +00:00

13 KiB

Raw Blame History

Inference Cost Optimization Plan

Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.

Date: 2026-03-05 Status: Planning

Current State

Item	Value
Main session model	`litellm/copilot-claude-opus-4.6` (via GitHub Copilot)
Default agent model	`litellm/copilot-claude-sonnet-4.6`
Prompt caching	NOT SET (no `cacheRetention` configured)
Context pruning	NOT SET (no `contextPruning` configured)
Heartbeat	30m (main agent only)
Subagent model	Inherits session model (expensive!)
Free models available	`zai/glm-4.7`, `zai/glm-4.7-flash`, `zai/glm-4.7-flashx`, `zai/glm-5` (all $0)
Copilot models	Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token)

Cost Structure

Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.

Phase 1: Enable Prompt Caching

What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.

Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "models": {
        "litellm/copilot-claude-opus-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-opus-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-haiku-4.5": {
          "params": {
            "cacheRetention": "short"
          }
        }
      }
    }
  }
}

Verification:

After applying, check /status or /usage full for cacheRead vs cacheWrite tokens.

Enable cache trace diagnostics temporarily:

{ "diagnostics": { "cacheTrace": { "enabled": true } } }

First turn will show high cacheWrite (populating cache). Subsequent turns should show high cacheRead with much lower cacheWrite.
Target: >60% cache hit rate within 2-3 turns of a session.

Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.

Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.

Phase 2: Heartbeat Cache Warming

What: Align heartbeat interval to keep the prompt cache warm across idle gaps.

Why: Anthropic's long cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "55m"
      }
    },
    "list": [
      {
        "id": "main",
        "heartbeat": {
          "every": "25m"
        }
      }
    ]
  }
}

Rationale:

Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)

Verification:

After a 30-minute idle gap, check that the next interaction shows cacheRead (not all cacheWrite).
Monitor heartbeat token cost via /usage full on a heartbeat response.

Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.

Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.

Phase 3: Context Pruning

What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.

Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "1h"
      }
    }
  }
}

Rationale:

cache-ttl mode: prunes old tool-result context after the cache TTL expires
ttl: "1h": matches Anthropic's long cache retention window
After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context

Verification:

Use /context list or /context detail to check context size before and after pruning.
After a >1h idle gap, verify the context window is smaller than before the gap.
Ensure no critical context is lost — compaction summaries should preserve key information.

Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.

Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).

Phase 4: Cheaper Models for Subagents

What: Route subagent tasks to cheaper models based on task complexity, with quality verification.

Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.

Model Tier Strategy

Tier	Model	Use Case	Cost
Free	`zai/glm-4.7`	Bulk subagent work: council advisors, brainstorming, summarization, classification	$0
Free-fast	`zai/glm-4.7-flash`	Simple/short subagent tasks: acknowledgments, formatting, quick lookups	$0
Cheap	`litellm/copilot-claude-haiku-4.5`	Tasks needing Claude quality but not heavy reasoning	Low quota
Standard	`litellm/copilot-claude-sonnet-4.6`	Tasks needing strong reasoning, code generation, analysis	Medium quota
Frontier	`litellm/copilot-claude-opus-4.6`	Only for: main session, referee/meta-arbiter, critical decisions	High quota

Implementation

4a. Council Skill — Default to GLM-4.7

Update council skill to use cheaper models by default:

Council Role	Default Model	Override for `tier=heavy`
Personality advisors	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
D/P Freethinkers	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
D/P Arbiters	`zai/glm-4.7`	`litellm/copilot-claude-sonnet-4.6`
Referee / Meta-Arbiter	`litellm/copilot-claude-sonnet-4.6`	`litellm/copilot-claude-opus-4.6`

When spawning subagents via sessions_spawn, pass the model parameter:

{
  "task": "...",
  "mode": "run",
  "label": "council-pragmatist",
  "model": "zai/glm-4.7"
}

4b. General Subagent Routing Guidelines

Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:

Use zai/glm-4.7 (free) when:

Task is well-defined with clear constraints
Output format is specified in the prompt
Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
Task doesn't require tool use or complex multi-step reasoning

Use litellm/copilot-claude-sonnet-4.6 (standard) when:

Task requires nuanced reasoning or analysis
Task involves code generation or review
Output quality is user-facing and high-stakes
Task requires understanding subtle context

Use litellm/copilot-claude-opus-4.6 (frontier) when:

Main interactive session only
Final synthesis / referee / meta-arbiter roles
Tasks where the user explicitly asked for highest quality

4c. Quality Verification Strategy

Before switching council and subagents to GLM-4.7, run a quality comparison:

Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
Scoring rubric:
- Does the advisor stay in character? (yes/no)
- Is the output substantive (not generic platitudes)? (1-5)
- Does it follow word count guidance? (within 50% of target)
- Does it reference specific aspects of the topic? (1-5)
Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.

4d. Prompt Engineering for Cheaper Models

Cheaper models need tighter prompts to maintain quality. Key techniques:

Be more explicit about output format: Include examples, not just descriptions
Constrain output length more tightly: "Respond in exactly 3 paragraphs" vs "200-400 words"
Use structured output requests: Ask for numbered lists, specific headers
Front-load the most important instruction: Put the role and constraint first, context second
Include a quality check instruction: "Before responding, verify your output matches the requested format"

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Apply all three config changes to ~/.openclaw/openclaw.json:

cacheRetention: "long" on Claude models
heartbeat.every: "25m" for main, "55m" default
contextPruning.mode: "cache-ttl" with ttl: "1h"

Restart gateway: openclaw gateway restart

Verify with /status and /usage full over next few interactions.

Step 2: Quality test GLM-4.7 for subagent work

Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.

Step 3: Update council skill for model tiers

If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.

Step 4: Update AGENTS.md with subagent routing guidelines

Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.

Step 5: Monitor and tune

Track cache hit rates over 1-2 days
Monitor if context pruning causes any information loss
Adjust heartbeat timing if cache misses are too frequent
Tune GLM-4.7 prompts based on observed output quality

What This Does NOT Change

No OpenClaw code changes: Everything is config-only in openclaw.json
No upstream divergence: All settings use documented OpenClaw config knobs
No new infrastructure: No proxy servers, routers, or middleware
Main session stays on Opus: Only subagents move to cheaper models
Fully reversible: Remove the config keys to revert to current behavior

Expected Combined Impact

Optimization	Estimated Savings	Confidence
Prompt caching	40-60% input token reduction	High
Cache warming via heartbeat	Maintains cache savings across idle	High
Context pruning	20-30% context size reduction for long sessions	Medium
Subagent model routing	60-80% subagent cost (free model for bulk work)	Medium (pending quality test)

Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.

13 KiB Raw Blame History

Inference Cost Optimization Plan

Current State

Cost Structure

Phase 1: Enable Prompt Caching

Phase 2: Heartbeat Cache Warming

Phase 3: Context Pruning

Phase 4: Cheaper Models for Subagents

Model Tier Strategy

Implementation

4a. Council Skill — Default to GLM-4.7

4b. General Subagent Routing Guidelines

4c. Quality Verification Strategy

4d. Prompt Engineering for Cheaper Models

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Step 2: Quality test GLM-4.7 for subagent work

Step 3: Update council skill for model tiers

Step 4: Update AGENTS.md with subagent routing guidelines

Step 5: Monitor and tune

What This Does NOT Change

Expected Combined Impact

13 KiB

Raw Blame History