From 6642964ae69fa68d3a877f63643cadda27f5b73c Mon Sep 17 00:00:00 2001 From: zap Date: Thu, 5 Mar 2026 20:20:03 +0000 Subject: [PATCH] =?UTF-8?q?docs(cost):=20add=20inference=20cost=20optimiza?= =?UTF-8?q?tion=20plan=20=E2=80=94=204=20phases?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 1: Enable prompt caching (cacheRetention: long on Claude models) Phase 2: Heartbeat cache warming (25m main, 55m default) Phase 3: Context pruning (cache-ttl mode, 1h TTL) Phase 4: Cheaper models for subagents (GLM-4.7 free tier for bulk work) All config-only, no OpenClaw code changes, fully reversible. --- memory/plans/inference-cost-optimization.md | 304 ++++++++++++++++++++ 1 file changed, 304 insertions(+) create mode 100644 memory/plans/inference-cost-optimization.md diff --git a/memory/plans/inference-cost-optimization.md b/memory/plans/inference-cost-optimization.md new file mode 100644 index 0000000..6b0970c --- /dev/null +++ b/memory/plans/inference-cost-optimization.md @@ -0,0 +1,304 @@ +# Inference Cost Optimization Plan + +**Goal**: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible. + +**Date**: 2026-03-05 +**Status**: Planning + +--- + +## Current State + +| Item | Value | +|------|-------| +| Main session model | `litellm/copilot-claude-opus-4.6` (via GitHub Copilot) | +| Default agent model | `litellm/copilot-claude-sonnet-4.6` | +| Prompt caching | **NOT SET** (no `cacheRetention` configured) | +| Context pruning | **NOT SET** (no `contextPruning` configured) | +| Heartbeat | 30m (main agent only) | +| Subagent model | Inherits session model (expensive!) | +| Free models available | `zai/glm-4.7`, `zai/glm-4.7-flash`, `zai/glm-4.7-flashx`, `zai/glm-5` (all $0) | +| Copilot models | Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token) | + +### Cost Structure +- **Copilot models** (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota. +- **ZAI models** (zai/glm-*): Free tier, no per-token cost. Quality varies by task type. +- The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models. + +--- + +## Phase 1: Enable Prompt Caching + +**What**: Configure `cacheRetention` on Anthropic-backed models so repeated system prompts and stable context get cached by the provider. + +**Why**: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing). + +**Config change** (`~/.openclaw/openclaw.json`): +```json +{ + "agents": { + "defaults": { + "models": { + "litellm/copilot-claude-opus-4.6": { + "params": { + "cacheRetention": "long" + } + }, + "litellm/copilot-claude-sonnet-4.6": { + "params": { + "cacheRetention": "long" + } + }, + "litellm/copilot-claude-opus-4.5": { + "params": { + "cacheRetention": "long" + } + }, + "litellm/copilot-claude-sonnet-4.5": { + "params": { + "cacheRetention": "long" + } + }, + "litellm/copilot-claude-haiku-4.5": { + "params": { + "cacheRetention": "short" + } + } + } + } + } +} +``` + +**Verification**: +1. After applying, check `/status` or `/usage full` for `cacheRead` vs `cacheWrite` tokens. +2. Enable cache trace diagnostics temporarily: + ```json + { "diagnostics": { "cacheTrace": { "enabled": true } } } + ``` +3. First turn will show high `cacheWrite` (populating cache). Subsequent turns should show high `cacheRead` with much lower `cacheWrite`. +4. Target: >60% cache hit rate within 2-3 turns of a session. + +**Risk**: Zero. Caching doesn't change outputs — it's purely a provider-side optimization. + +**Expected impact**: 40-60% reduction in input token processing cost for sessions with multiple turns. + +--- + +## Phase 2: Heartbeat Cache Warming + +**What**: Align heartbeat interval to keep the prompt cache warm across idle gaps. + +**Why**: Anthropic's `long` cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes. + +**Config change** (`~/.openclaw/openclaw.json`): +```json +{ + "agents": { + "defaults": { + "heartbeat": { + "every": "55m" + } + }, + "list": [ + { + "id": "main", + "heartbeat": { + "every": "25m" + } + } + ] + } +} +``` + +**Rationale**: +- Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use) +- Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle) +- If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat) + +**Verification**: +1. After a 30-minute idle gap, check that the next interaction shows `cacheRead` (not all `cacheWrite`). +2. Monitor heartbeat token cost via `/usage full` on a heartbeat response. + +**Risk**: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this. + +**Expected impact**: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry. + +--- + +## Phase 3: Context Pruning + +**What**: Enable `cache-ttl` context pruning so old tool results and conversation history get pruned after the cache window expires. + +**Why**: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper. + +**Config change** (`~/.openclaw/openclaw.json`): +```json +{ + "agents": { + "defaults": { + "contextPruning": { + "mode": "cache-ttl", + "ttl": "1h" + } + } + } +} +``` + +**Rationale**: +- `cache-ttl` mode: prunes old tool-result context after the cache TTL expires +- `ttl: "1h"`: matches Anthropic's `long` cache retention window +- After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context + +**Verification**: +1. Use `/context list` or `/context detail` to check context size before and after pruning. +2. After a >1h idle gap, verify the context window is smaller than before the gap. +3. Ensure no critical context is lost — compaction summaries should preserve key information. + +**Risk**: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event. + +**Expected impact**: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context). + +--- + +## Phase 4: Cheaper Models for Subagents + +**What**: Route subagent tasks to cheaper models based on task complexity, with quality verification. + +**Why**: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus. + +### Model Tier Strategy + +| Tier | Model | Use Case | Cost | +|------|-------|----------|------| +| **Free** | `zai/glm-4.7` | Bulk subagent work: council advisors, brainstorming, summarization, classification | $0 | +| **Free-fast** | `zai/glm-4.7-flash` | Simple/short subagent tasks: acknowledgments, formatting, quick lookups | $0 | +| **Cheap** | `litellm/copilot-claude-haiku-4.5` | Tasks needing Claude quality but not heavy reasoning | Low quota | +| **Standard** | `litellm/copilot-claude-sonnet-4.6` | Tasks needing strong reasoning, code generation, analysis | Medium quota | +| **Frontier** | `litellm/copilot-claude-opus-4.6` | Only for: main session, referee/meta-arbiter, critical decisions | High quota | + +### Implementation + +#### 4a. Council Skill — Default to GLM-4.7 + +Update council skill to use cheaper models by default: + +| Council Role | Default Model | Override for `tier=heavy` | +|-------------|---------------|--------------------------| +| Personality advisors | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` | +| D/P Freethinkers | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` | +| D/P Arbiters | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` | +| Referee / Meta-Arbiter | `litellm/copilot-claude-sonnet-4.6` | `litellm/copilot-claude-opus-4.6` | + +When spawning subagents via `sessions_spawn`, pass the `model` parameter: +```json +{ + "task": "...", + "mode": "run", + "label": "council-pragmatist", + "model": "zai/glm-4.7" +} +``` + +#### 4b. General Subagent Routing Guidelines + +Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern: + +**Use `zai/glm-4.7` (free) when**: +- Task is well-defined with clear constraints +- Output format is specified in the prompt +- Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A +- Task doesn't require tool use or complex multi-step reasoning + +**Use `litellm/copilot-claude-sonnet-4.6` (standard) when**: +- Task requires nuanced reasoning or analysis +- Task involves code generation or review +- Output quality is user-facing and high-stakes +- Task requires understanding subtle context + +**Use `litellm/copilot-claude-opus-4.6` (frontier) when**: +- Main interactive session only +- Final synthesis / referee / meta-arbiter roles +- Tasks where the user explicitly asked for highest quality + +#### 4c. Quality Verification Strategy + +Before switching council and subagents to GLM-4.7, run a quality comparison: + +1. **Same-topic test**: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side. +2. **Structured output test**: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying). +3. **Scoring rubric**: + - Does the advisor stay in character? (yes/no) + - Is the output substantive (not generic platitudes)? (1-5) + - Does it follow word count guidance? (within 50% of target) + - Does it reference specific aspects of the topic? (1-5) +4. **Minimum quality bar**: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+. + +#### 4d. Prompt Engineering for Cheaper Models + +Cheaper models need tighter prompts to maintain quality. Key techniques: + +- **Be more explicit about output format**: Include examples, not just descriptions +- **Constrain output length more tightly**: "Respond in exactly 3 paragraphs" vs "200-400 words" +- **Use structured output requests**: Ask for numbered lists, specific headers +- **Front-load the most important instruction**: Put the role and constraint first, context second +- **Include a quality check instruction**: "Before responding, verify your output matches the requested format" + +--- + +## Implementation Order + +### Step 1: Config changes (Phases 1-3) — Do together, single commit + +Apply all three config changes to `~/.openclaw/openclaw.json`: +- `cacheRetention: "long"` on Claude models +- `heartbeat.every: "25m"` for main, `"55m"` default +- `contextPruning.mode: "cache-ttl"` with `ttl: "1h"` + +Restart gateway: `openclaw gateway restart` + +Verify with `/status` and `/usage full` over next few interactions. + +### Step 2: Quality test GLM-4.7 for subagent work + +Run a single council advisor (e.g., Pragmatist) on a known topic using `model: "zai/glm-4.7"` in `sessions_spawn`. Compare output quality against the Sonnet run we already have saved. + +### Step 3: Update council skill for model tiers + +If GLM-4.7 passes quality bar, update `skills/council/SKILL.md` and `scripts/council.sh` with the model tier routing table. Update `references/prompts.md` with tighter prompt variants for cheaper models if needed. + +### Step 4: Update AGENTS.md with subagent routing guidelines + +Add a section documenting when to use which model tier for subagents, so the convention is followed consistently. + +### Step 5: Monitor and tune + +- Track cache hit rates over 1-2 days +- Monitor if context pruning causes any information loss +- Adjust heartbeat timing if cache misses are too frequent +- Tune GLM-4.7 prompts based on observed output quality + +--- + +## What This Does NOT Change + +- **No OpenClaw code changes**: Everything is config-only in `openclaw.json` +- **No upstream divergence**: All settings use documented OpenClaw config knobs +- **No new infrastructure**: No proxy servers, routers, or middleware +- **Main session stays on Opus**: Only subagents move to cheaper models +- **Fully reversible**: Remove the config keys to revert to current behavior + +--- + +## Expected Combined Impact + +| Optimization | Estimated Savings | Confidence | +|-------------|-------------------|------------| +| Prompt caching | 40-60% input token reduction | High | +| Cache warming via heartbeat | Maintains cache savings across idle | High | +| Context pruning | 20-30% context size reduction for long sessions | Medium | +| Subagent model routing | 60-80% subagent cost (free model for bulk work) | Medium (pending quality test) | + +**Combined**: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.