Files
swarm-zap/memory/plans/inference-cost-optimization.md
zap 23782735a1 docs(cost): update plan with corrections from official docs
- Phase 1: clarify cacheRetention only applies to Claude models; GPT auto-caches; GLM has none
- Phase 1: add TTL reality check (short=5min, long=1h) and implications for heartbeat timing
- Phase 2: explain why long TTL + 25m heartbeat is the right combo
- Phase 4: replace generic prompt tips with model-specific guidance from official Anthropic/OpenAI docs
- Added prompt structure notes for cache efficiency, GLM-4.7 tighter prompting requirements
- References: memory/references/*.md
2026-03-05 20:37:32 +00:00

18 KiB

Inference Cost Optimization Plan

Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.

Date: 2026-03-05 Status: Planning


Current State

Item Value
Main session model litellm/copilot-claude-opus-4.6 (via GitHub Copilot)
Default agent model litellm/copilot-claude-sonnet-4.6
Prompt caching NOT SET (no cacheRetention configured)
Context pruning NOT SET (no contextPruning configured)
Heartbeat 30m (main agent only)
Subagent model Inherits session model (expensive!)
Free models available zai/glm-4.7, zai/glm-4.7-flash, zai/glm-4.7-flashx, zai/glm-5 (all $0)
Copilot models Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token)

Cost Structure

  • Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
  • ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
  • The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.

Phase 1: Enable Prompt Caching

What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.

Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).

Applies only to Claude-backed models. GPT, GLM, and Gemini models do NOT support Anthropic's cacheRetention mechanism. OpenAI caching is automatic (no config needed). ZAI/GLM has no caching mechanism.

Cache TTL reality check (from official Anthropic docs):

  • short (cacheRetention) = 5-minute TTL (default — refreshed on each use within the window)
  • long = 1-hour TTL (at higher write cost: 2x base input price vs 1.25x for short)
  • Cache reads cost 0.1x (10%) of base input — so a cache hit on a 15k-token system prompt costs 90% less
  • First turn writes cache (slightly more expensive), subsequent turns read it (very cheap)

Implication: With 5-minute default TTL and a 30-minute heartbeat, the cache expires between every heartbeat. Either:

  1. Use long (1h TTL) and set heartbeat to 55m to keep warm — best for cost savings
  2. Use short (5m TTL) with no heartbeat adjustment — cache only helps within active bursts

Recommendation: Use long on main session Claude models + 25m heartbeat (well within 1h). Cache writes are slightly more expensive but the read savings dominate for any active session.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "models": {
        "litellm/copilot-claude-opus-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-opus-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-haiku-4.5": {
          "params": {
            "cacheRetention": "short"
          }
        }
      }
    }
  }
}

Note: No config needed for GPT models — OpenAI caches automatically for free on prompts ≥1024 tokens. No config available for ZAI/GLM models (no caching support).

Verification:

  1. After applying, check /status or /usage full for cacheRead vs cacheWrite tokens.
  2. Enable cache trace diagnostics temporarily:
    { "diagnostics": { "cacheTrace": { "enabled": true } } }
    
  3. First turn will show high cacheWrite (populating cache). Subsequent turns should show high cacheRead with much lower cacheWrite.
  4. Target: >60% cache hit rate within 2-3 turns of a session.

Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.

Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.


Phase 2: Heartbeat Cache Warming

What: Align heartbeat interval to keep the 1-hour prompt cache warm across idle gaps.

Why: With cacheRetention: "long" (1h TTL), the cache expires after 1 hour of no activity. A heartbeat just under 1h ensures the cache is touched before it expires, so the next real interaction reads from cache instead of rewriting it. Our current 30m heartbeat already works, but 25m gives a safety margin.

Important: Heartbeat keep-warm only applies to Claude models. GPT/GLM models don't benefit — their caching is either automatic (OpenAI) or non-existent (ZAI).

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "55m"
      }
    },
    "list": [
      {
        "id": "main",
        "heartbeat": {
          "every": "25m"
        }
      }
    ]
  }
}

Rationale:

  • Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
  • Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
  • If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)

Verification:

  1. After a 30-minute idle gap, check that the next interaction shows cacheRead (not all cacheWrite).
  2. Monitor heartbeat token cost via /usage full on a heartbeat response.

Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.

Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.


Phase 3: Context Pruning

What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.

Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "1h"
      }
    }
  }
}

Rationale:

  • cache-ttl mode: prunes old tool-result context after the cache TTL expires
  • ttl: "1h": matches Anthropic's long cache retention window
  • After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context

Verification:

  1. Use /context list or /context detail to check context size before and after pruning.
  2. After a >1h idle gap, verify the context window is smaller than before the gap.
  3. Ensure no critical context is lost — compaction summaries should preserve key information.

Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.

Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).


Phase 4: Cheaper Models for Subagents

What: Route subagent tasks to cheaper models based on task complexity, with quality verification.

Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.

Model Tier Strategy

Tier Model Use Case Cost
Free zai/glm-4.7 Bulk subagent work: council advisors, brainstorming, summarization, classification $0
Free-fast zai/glm-4.7-flash Simple/short subagent tasks: acknowledgments, formatting, quick lookups $0
Cheap litellm/copilot-claude-haiku-4.5 Tasks needing Claude quality but not heavy reasoning Low quota
Standard litellm/copilot-claude-sonnet-4.6 Tasks needing strong reasoning, code generation, analysis Medium quota
Frontier litellm/copilot-claude-opus-4.6 Only for: main session, referee/meta-arbiter, critical decisions High quota

Implementation

4a. Council Skill — Default to GLM-4.7

Update council skill to use cheaper models by default:

Council Role Default Model Override for tier=heavy
Personality advisors zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
D/P Freethinkers zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
D/P Arbiters zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
Referee / Meta-Arbiter litellm/copilot-claude-sonnet-4.6 litellm/copilot-claude-opus-4.6

When spawning subagents via sessions_spawn, pass the model parameter:

{
  "task": "...",
  "mode": "run",
  "label": "council-pragmatist",
  "model": "zai/glm-4.7"
}

4b. General Subagent Routing Guidelines

Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:

Use zai/glm-4.7 (free) when:

  • Task is well-defined with clear constraints
  • Output format is specified in the prompt
  • Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
  • Task doesn't require tool use or complex multi-step reasoning

Use litellm/copilot-claude-sonnet-4.6 (standard) when:

  • Task requires nuanced reasoning or analysis
  • Task involves code generation or review
  • Output quality is user-facing and high-stakes
  • Task requires understanding subtle context

Use litellm/copilot-claude-opus-4.6 (frontier) when:

  • Main interactive session only
  • Final synthesis / referee / meta-arbiter roles
  • Tasks where the user explicitly asked for highest quality

Subagent Model Notes

  • When spawning subagents for Claude tasks, caching applies if the subagent model is also a Claude model. But subagents are typically short-lived (single-turn mode=run), so caching benefit is minimal — they don't accumulate conversation history.
  • The main caching win is in the main session, which has a large, growing context across many turns.
  • For GLM-4.7 subagents: no caching benefit, but no cost either ($0 model). Prompts must be self-contained and tightly framed.
  • For GPT subagents: OpenAI caches automatically if prompt ≥1024 tokens, no action needed.

Prompt Structure for Maximum Cache Efficiency (Claude models)

Per official Anthropic best practices:

  • Static first, dynamic last: System prompt, role, instructions → then topic/task (dynamic part).
  • This structure is what OpenClaw already does (system prompt built once, user message varies).
  • OpenClaw handles the cache_control injection automatically via cacheRetention config.
  • Our prompts already follow this structure correctly.

Before switching council and subagents to GLM-4.7, run a quality comparison:

  1. Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
  2. Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
  3. Scoring rubric:
    • Does the advisor stay in character? (yes/no)
    • Is the output substantive (not generic platitudes)? (1-5)
    • Does it follow word count guidance? (within 50% of target)
    • Does it reference specific aspects of the topic? (1-5)
  4. Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.

4d. Prompt Engineering Per Model — Official Best Practices

From official Anthropic and OpenAI docs (see memory/references/):

For Claude models (Haiku, Sonnet — subagent advisors)

  • Give an explicit role in the system prompt: You are the Skeptic advisor on a council...
  • Use XML tags to separate role, instructions, context, topic: <instructions>, <context>, <topic>
  • Put static instructions first, variable topic at the end (maximizes cache hit rate on repeated spawns)
  • 3-5 <example> tags for structured output formats
  • Use "tell what to do" not "don't do X": Write in flowing prose not Don't use bullet points

For GLM-4.7 (free tier subagents)

  • Be MORE explicit than with Claude — GLM needs tighter constraints
  • Constrain output length tightly: "Respond in exactly 3 paragraphs" not "200-400 words"
  • Use numbered lists or explicit section headers in the prompt
  • Front-load the most critical instruction (role + constraint first, context second)
  • Include a format-check reminder: "Before responding, verify your output matches the format above"
  • Request structured output over open-ended generation when possible
  • Avoid complex multi-step reasoning chains — GLM handles simpler, well-defined tasks best

For GPT models (gpt-5-mini, gpt-4.1 subagents)

  • Include explicit step-by-step instructions (GPT benefits from "think step by step" guidance)
  • Use response_format: json_schema for any scored/structured output — eliminates format retries entirely
  • Use developer role for system/role instructions (higher priority than user)
  • Don't over-specify for reasoning models (o3, o4-mini) — they reason internally
  • Pin to specific model snapshots if quality consistency matters (gpt-4.1-2025-04-14)

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Apply all three config changes to ~/.openclaw/openclaw.json:

  • cacheRetention: "long" on Claude models
  • heartbeat.every: "25m" for main, "55m" default
  • contextPruning.mode: "cache-ttl" with ttl: "1h"

Restart gateway: openclaw gateway restart

Verify with /status and /usage full over next few interactions.

Step 2: Quality test GLM-4.7 for subagent work

Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.

Step 3: Update council skill for model tiers

If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.

Step 4: Update AGENTS.md with subagent routing guidelines

Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.

Step 5: Monitor and tune

  • Track cache hit rates over 1-2 days
  • Monitor if context pruning causes any information loss
  • Adjust heartbeat timing if cache misses are too frequent
  • Tune GLM-4.7 prompts based on observed output quality

Upstream Safety Rules

These are hard constraints. Any implementation that violates them is out of scope.

Never do

  • Edit files under ~/.npm-global/lib/node_modules/openclaw/ directly (dist, src, docs)
  • Patch or monkey-patch OpenClaw's runtime code, even for emergencies (exception: the existing TUI patch has a tracked upstream PR — document any new ones immediately)
  • Add config keys not documented in OpenClaw's own docs (guessing at undocumented keys can silently break on upgrade)
  • Modify ~/.openclaw/openclaw.json in a way that would be overwritten or invalidated by openclaw update
  • Introduce any middleware, proxy, or hook that intercepts OpenClaw's internal request path

Safe to do

  • Edit ~/.openclaw/openclaw.json using documented config knobs (agents, models, diagnostics, contextPruning, etc.)
  • Add/edit workspace files (~/.openclaw/workspace/) freely — these are never touched by OpenClaw updates
  • Install/update skills via clawhub — skills are workspace-local
  • Run openclaw gateway restart after config changes
  • Use openclaw update status / scripts/openclaw-update-safe.sh to check for upstream updates

Checking before applying

Before implementing any config change:

  1. Verify the key exists in /home/openclaw/.npm-global/lib/node_modules/openclaw/docs/ or https://docs.openclaw.ai
  2. If undocumented: skip it or open a question/issue — don't guess
  3. After openclaw update, re-verify config keys still work (check gateway logs for config parse errors)

Update workflow

# Before updating OpenClaw
openclaw update status           # check what version is available
# Review changelog for breaking config changes
openclaw update                  # update (safe scripts handle local compat)
openclaw gateway restart         # restart to pick up new version
# Verify gateway health + session model still resolves correctly

What This Does NOT Change

  • No OpenClaw code changes: Everything is config-only in openclaw.json
  • No upstream divergence: All settings use documented OpenClaw config knobs
  • No new infrastructure: No proxy servers, routers, or middleware
  • Main session stays on Opus: Only subagents move to cheaper models
  • Fully reversible: Remove the config keys to revert to current behavior

Expected Combined Impact

Optimization Estimated Savings Confidence
Prompt caching 40-60% input token reduction High
Cache warming via heartbeat Maintains cache savings across idle High
Context pruning 20-30% context size reduction for long sessions Medium
Subagent model routing 60-80% subagent cost (free model for bulk work) Medium (pending quality test)

Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.