Files
swarm-zap/memory/plans/inference-cost-optimization.md
zap 6642964ae6 docs(cost): add inference cost optimization plan — 4 phases
Phase 1: Enable prompt caching (cacheRetention: long on Claude models)
Phase 2: Heartbeat cache warming (25m main, 55m default)
Phase 3: Context pruning (cache-ttl mode, 1h TTL)
Phase 4: Cheaper models for subagents (GLM-4.7 free tier for bulk work)

All config-only, no OpenClaw code changes, fully reversible.
2026-03-05 20:20:03 +00:00

13 KiB

Inference Cost Optimization Plan

Goal: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.

Date: 2026-03-05 Status: Planning


Current State

Item Value
Main session model litellm/copilot-claude-opus-4.6 (via GitHub Copilot)
Default agent model litellm/copilot-claude-sonnet-4.6
Prompt caching NOT SET (no cacheRetention configured)
Context pruning NOT SET (no contextPruning configured)
Heartbeat 30m (main agent only)
Subagent model Inherits session model (expensive!)
Free models available zai/glm-4.7, zai/glm-4.7-flash, zai/glm-4.7-flashx, zai/glm-5 (all $0)
Copilot models Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token)

Cost Structure

  • Copilot models (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
  • ZAI models (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
  • The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.

Phase 1: Enable Prompt Caching

What: Configure cacheRetention on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.

Why: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "models": {
        "litellm/copilot-claude-opus-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.6": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-opus-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-sonnet-4.5": {
          "params": {
            "cacheRetention": "long"
          }
        },
        "litellm/copilot-claude-haiku-4.5": {
          "params": {
            "cacheRetention": "short"
          }
        }
      }
    }
  }
}

Verification:

  1. After applying, check /status or /usage full for cacheRead vs cacheWrite tokens.
  2. Enable cache trace diagnostics temporarily:
    { "diagnostics": { "cacheTrace": { "enabled": true } } }
    
  3. First turn will show high cacheWrite (populating cache). Subsequent turns should show high cacheRead with much lower cacheWrite.
  4. Target: >60% cache hit rate within 2-3 turns of a session.

Risk: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.

Expected impact: 40-60% reduction in input token processing cost for sessions with multiple turns.


Phase 2: Heartbeat Cache Warming

What: Align heartbeat interval to keep the prompt cache warm across idle gaps.

Why: Anthropic's long cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "heartbeat": {
        "every": "55m"
      }
    },
    "list": [
      {
        "id": "main",
        "heartbeat": {
          "every": "25m"
        }
      }
    ]
  }
}

Rationale:

  • Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
  • Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
  • If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)

Verification:

  1. After a 30-minute idle gap, check that the next interaction shows cacheRead (not all cacheWrite).
  2. Monitor heartbeat token cost via /usage full on a heartbeat response.

Risk: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.

Expected impact: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.


Phase 3: Context Pruning

What: Enable cache-ttl context pruning so old tool results and conversation history get pruned after the cache window expires.

Why: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.

Config change (~/.openclaw/openclaw.json):

{
  "agents": {
    "defaults": {
      "contextPruning": {
        "mode": "cache-ttl",
        "ttl": "1h"
      }
    }
  }
}

Rationale:

  • cache-ttl mode: prunes old tool-result context after the cache TTL expires
  • ttl: "1h": matches Anthropic's long cache retention window
  • After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context

Verification:

  1. Use /context list or /context detail to check context size before and after pruning.
  2. After a >1h idle gap, verify the context window is smaller than before the gap.
  3. Ensure no critical context is lost — compaction summaries should preserve key information.

Risk: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.

Expected impact: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).


Phase 4: Cheaper Models for Subagents

What: Route subagent tasks to cheaper models based on task complexity, with quality verification.

Why: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.

Model Tier Strategy

Tier Model Use Case Cost
Free zai/glm-4.7 Bulk subagent work: council advisors, brainstorming, summarization, classification $0
Free-fast zai/glm-4.7-flash Simple/short subagent tasks: acknowledgments, formatting, quick lookups $0
Cheap litellm/copilot-claude-haiku-4.5 Tasks needing Claude quality but not heavy reasoning Low quota
Standard litellm/copilot-claude-sonnet-4.6 Tasks needing strong reasoning, code generation, analysis Medium quota
Frontier litellm/copilot-claude-opus-4.6 Only for: main session, referee/meta-arbiter, critical decisions High quota

Implementation

4a. Council Skill — Default to GLM-4.7

Update council skill to use cheaper models by default:

Council Role Default Model Override for tier=heavy
Personality advisors zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
D/P Freethinkers zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
D/P Arbiters zai/glm-4.7 litellm/copilot-claude-sonnet-4.6
Referee / Meta-Arbiter litellm/copilot-claude-sonnet-4.6 litellm/copilot-claude-opus-4.6

When spawning subagents via sessions_spawn, pass the model parameter:

{
  "task": "...",
  "mode": "run",
  "label": "council-pragmatist",
  "model": "zai/glm-4.7"
}

4b. General Subagent Routing Guidelines

Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:

Use zai/glm-4.7 (free) when:

  • Task is well-defined with clear constraints
  • Output format is specified in the prompt
  • Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
  • Task doesn't require tool use or complex multi-step reasoning

Use litellm/copilot-claude-sonnet-4.6 (standard) when:

  • Task requires nuanced reasoning or analysis
  • Task involves code generation or review
  • Output quality is user-facing and high-stakes
  • Task requires understanding subtle context

Use litellm/copilot-claude-opus-4.6 (frontier) when:

  • Main interactive session only
  • Final synthesis / referee / meta-arbiter roles
  • Tasks where the user explicitly asked for highest quality

4c. Quality Verification Strategy

Before switching council and subagents to GLM-4.7, run a quality comparison:

  1. Same-topic test: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
  2. Structured output test: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
  3. Scoring rubric:
    • Does the advisor stay in character? (yes/no)
    • Is the output substantive (not generic platitudes)? (1-5)
    • Does it follow word count guidance? (within 50% of target)
    • Does it reference specific aspects of the topic? (1-5)
  4. Minimum quality bar: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.

4d. Prompt Engineering for Cheaper Models

Cheaper models need tighter prompts to maintain quality. Key techniques:

  • Be more explicit about output format: Include examples, not just descriptions
  • Constrain output length more tightly: "Respond in exactly 3 paragraphs" vs "200-400 words"
  • Use structured output requests: Ask for numbered lists, specific headers
  • Front-load the most important instruction: Put the role and constraint first, context second
  • Include a quality check instruction: "Before responding, verify your output matches the requested format"

Implementation Order

Step 1: Config changes (Phases 1-3) — Do together, single commit

Apply all three config changes to ~/.openclaw/openclaw.json:

  • cacheRetention: "long" on Claude models
  • heartbeat.every: "25m" for main, "55m" default
  • contextPruning.mode: "cache-ttl" with ttl: "1h"

Restart gateway: openclaw gateway restart

Verify with /status and /usage full over next few interactions.

Step 2: Quality test GLM-4.7 for subagent work

Run a single council advisor (e.g., Pragmatist) on a known topic using model: "zai/glm-4.7" in sessions_spawn. Compare output quality against the Sonnet run we already have saved.

Step 3: Update council skill for model tiers

If GLM-4.7 passes quality bar, update skills/council/SKILL.md and scripts/council.sh with the model tier routing table. Update references/prompts.md with tighter prompt variants for cheaper models if needed.

Step 4: Update AGENTS.md with subagent routing guidelines

Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.

Step 5: Monitor and tune

  • Track cache hit rates over 1-2 days
  • Monitor if context pruning causes any information loss
  • Adjust heartbeat timing if cache misses are too frequent
  • Tune GLM-4.7 prompts based on observed output quality

What This Does NOT Change

  • No OpenClaw code changes: Everything is config-only in openclaw.json
  • No upstream divergence: All settings use documented OpenClaw config knobs
  • No new infrastructure: No proxy servers, routers, or middleware
  • Main session stays on Opus: Only subagents move to cheaper models
  • Fully reversible: Remove the config keys to revert to current behavior

Expected Combined Impact

Optimization Estimated Savings Confidence
Prompt caching 40-60% input token reduction High
Cache warming via heartbeat Maintains cache savings across idle High
Context pruning 20-30% context size reduction for long sessions Medium
Subagent model routing 60-80% subagent cost (free model for bulk work) Medium (pending quality test)

Combined: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.