Hard constraints: no dist edits, no undocumented config keys, no middleware. Safe: documented openclaw.json knobs + workspace files only. Includes update workflow and pre-implementation verification checklist.
339 lines
14 KiB
Markdown
339 lines
14 KiB
Markdown
# Inference Cost Optimization Plan
|
|
|
|
**Goal**: Reduce LLM inference costs without quality loss using OpenClaw's built-in configuration knobs + smarter subagent model selection. No code changes to OpenClaw — config-only, fully upstream-compatible.
|
|
|
|
**Date**: 2026-03-05
|
|
**Status**: Planning
|
|
|
|
---
|
|
|
|
## Current State
|
|
|
|
| Item | Value |
|
|
|------|-------|
|
|
| Main session model | `litellm/copilot-claude-opus-4.6` (via GitHub Copilot) |
|
|
| Default agent model | `litellm/copilot-claude-sonnet-4.6` |
|
|
| Prompt caching | **NOT SET** (no `cacheRetention` configured) |
|
|
| Context pruning | **NOT SET** (no `contextPruning` configured) |
|
|
| Heartbeat | 30m (main agent only) |
|
|
| Subagent model | Inherits session model (expensive!) |
|
|
| Free models available | `zai/glm-4.7`, `zai/glm-4.7-flash`, `zai/glm-4.7-flashx`, `zai/glm-5` (all $0) |
|
|
| Copilot models | Flat-rate via GitHub Copilot subscription (effectively $0 marginal cost per token) |
|
|
|
|
### Cost Structure
|
|
- **Copilot models** (litellm/copilot-*): Covered by GitHub Copilot subscription — no per-token cost, but subject to rate limits and quotas. Using Opus when Sonnet suffices wastes quota.
|
|
- **ZAI models** (zai/glm-*): Free tier, no per-token cost. Quality varies by task type.
|
|
- The real "cost" is: (a) Copilot quota burn on expensive models, (b) latency, (c) quality risk on cheaper models.
|
|
|
|
---
|
|
|
|
## Phase 1: Enable Prompt Caching
|
|
|
|
**What**: Configure `cacheRetention` on Anthropic-backed models so repeated system prompts and stable context get cached by the provider.
|
|
|
|
**Why**: Our system prompt (AGENTS.md + SOUL.md + USER.md + TOOLS.md + IDENTITY.md + HEARTBEAT.md + skills list) is large and mostly static. Without caching, every turn reprocesses ~15-20k tokens of identical prefix. With caching, subsequent turns pay ~10% for cached tokens (Anthropic pricing).
|
|
|
|
**Config change** (`~/.openclaw/openclaw.json`):
|
|
```json
|
|
{
|
|
"agents": {
|
|
"defaults": {
|
|
"models": {
|
|
"litellm/copilot-claude-opus-4.6": {
|
|
"params": {
|
|
"cacheRetention": "long"
|
|
}
|
|
},
|
|
"litellm/copilot-claude-sonnet-4.6": {
|
|
"params": {
|
|
"cacheRetention": "long"
|
|
}
|
|
},
|
|
"litellm/copilot-claude-opus-4.5": {
|
|
"params": {
|
|
"cacheRetention": "long"
|
|
}
|
|
},
|
|
"litellm/copilot-claude-sonnet-4.5": {
|
|
"params": {
|
|
"cacheRetention": "long"
|
|
}
|
|
},
|
|
"litellm/copilot-claude-haiku-4.5": {
|
|
"params": {
|
|
"cacheRetention": "short"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Verification**:
|
|
1. After applying, check `/status` or `/usage full` for `cacheRead` vs `cacheWrite` tokens.
|
|
2. Enable cache trace diagnostics temporarily:
|
|
```json
|
|
{ "diagnostics": { "cacheTrace": { "enabled": true } } }
|
|
```
|
|
3. First turn will show high `cacheWrite` (populating cache). Subsequent turns should show high `cacheRead` with much lower `cacheWrite`.
|
|
4. Target: >60% cache hit rate within 2-3 turns of a session.
|
|
|
|
**Risk**: Zero. Caching doesn't change outputs — it's purely a provider-side optimization.
|
|
|
|
**Expected impact**: 40-60% reduction in input token processing cost for sessions with multiple turns.
|
|
|
|
---
|
|
|
|
## Phase 2: Heartbeat Cache Warming
|
|
|
|
**What**: Align heartbeat interval to keep the prompt cache warm across idle gaps.
|
|
|
|
**Why**: Anthropic's `long` cache retention is ~1 hour TTL. Our current heartbeat is 30m, which is already well under the TTL — good. But we should ensure the heartbeat is a lightweight keep-warm that doesn't generate expensive cache writes.
|
|
|
|
**Config change** (`~/.openclaw/openclaw.json`):
|
|
```json
|
|
{
|
|
"agents": {
|
|
"defaults": {
|
|
"heartbeat": {
|
|
"every": "55m"
|
|
}
|
|
},
|
|
"list": [
|
|
{
|
|
"id": "main",
|
|
"heartbeat": {
|
|
"every": "25m"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
**Rationale**:
|
|
- Main agent: keep at 25m (well within 1h TTL, ensures cache stays warm during active use)
|
|
- Other agents (claude, codex, copilot, opencode): 55m default (just under 1h TTL, minimal quota burn when idle)
|
|
- If an agent is rarely used, its heartbeat won't fire (disabled agents skip heartbeat)
|
|
|
|
**Verification**:
|
|
1. After a 30-minute idle gap, check that the next interaction shows `cacheRead` (not all `cacheWrite`).
|
|
2. Monitor heartbeat token cost via `/usage full` on a heartbeat response.
|
|
|
|
**Risk**: Low. Slightly more frequent heartbeat = slightly more baseline token usage, but the cache savings on real interactions outweigh this.
|
|
|
|
**Expected impact**: Maintains the Phase 1 cache savings across idle periods instead of losing them after TTL expiry.
|
|
|
|
---
|
|
|
|
## Phase 3: Context Pruning
|
|
|
|
**What**: Enable `cache-ttl` context pruning so old tool results and conversation history get pruned after the cache window expires.
|
|
|
|
**Why**: Long sessions accumulate tool results, file reads, and old conversation turns that bloat the context. Without pruning, post-idle requests re-cache the entire oversized history. Cache-TTL pruning trims stale context so re-caching after idle is smaller and cheaper.
|
|
|
|
**Config change** (`~/.openclaw/openclaw.json`):
|
|
```json
|
|
{
|
|
"agents": {
|
|
"defaults": {
|
|
"contextPruning": {
|
|
"mode": "cache-ttl",
|
|
"ttl": "1h"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Rationale**:
|
|
- `cache-ttl` mode: prunes old tool-result context after the cache TTL expires
|
|
- `ttl: "1h"`: matches Anthropic's `long` cache retention window
|
|
- After 1h of no interaction, old tool results and conversation history are pruned, so the next request re-caches a smaller context
|
|
|
|
**Verification**:
|
|
1. Use `/context list` or `/context detail` to check context size before and after pruning.
|
|
2. After a >1h idle gap, verify the context window is smaller than before the gap.
|
|
3. Ensure no critical context is lost — compaction summaries should preserve key information.
|
|
|
|
**Risk**: Low-medium. Pruning removes old tool results, which means the model can't reference exact earlier tool outputs after pruning. Compaction summaries mitigate this. Test by asking about earlier conversation after a pruning event.
|
|
|
|
**Expected impact**: 20-30% reduction in context size for long sessions, which reduces both input token cost and improves response quality (less noise in context).
|
|
|
|
---
|
|
|
|
## Phase 4: Cheaper Models for Subagents
|
|
|
|
**What**: Route subagent tasks to cheaper models based on task complexity, with quality verification.
|
|
|
|
**Why**: Currently ALL subagents inherit the session model (Opus 4.6 or whatever the session is on). Most subagent tasks (council advisors, research queries, simple generation) don't need frontier-model quality. ZAI GLM-4.7 is free and handles many tasks well. Copilot Sonnet/Haiku are much cheaper quota-wise than Opus.
|
|
|
|
### Model Tier Strategy
|
|
|
|
| Tier | Model | Use Case | Cost |
|
|
|------|-------|----------|------|
|
|
| **Free** | `zai/glm-4.7` | Bulk subagent work: council advisors, brainstorming, summarization, classification | $0 |
|
|
| **Free-fast** | `zai/glm-4.7-flash` | Simple/short subagent tasks: acknowledgments, formatting, quick lookups | $0 |
|
|
| **Cheap** | `litellm/copilot-claude-haiku-4.5` | Tasks needing Claude quality but not heavy reasoning | Low quota |
|
|
| **Standard** | `litellm/copilot-claude-sonnet-4.6` | Tasks needing strong reasoning, code generation, analysis | Medium quota |
|
|
| **Frontier** | `litellm/copilot-claude-opus-4.6` | Only for: main session, referee/meta-arbiter, critical decisions | High quota |
|
|
|
|
### Implementation
|
|
|
|
#### 4a. Council Skill — Default to GLM-4.7
|
|
|
|
Update council skill to use cheaper models by default:
|
|
|
|
| Council Role | Default Model | Override for `tier=heavy` |
|
|
|-------------|---------------|--------------------------|
|
|
| Personality advisors | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` |
|
|
| D/P Freethinkers | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` |
|
|
| D/P Arbiters | `zai/glm-4.7` | `litellm/copilot-claude-sonnet-4.6` |
|
|
| Referee / Meta-Arbiter | `litellm/copilot-claude-sonnet-4.6` | `litellm/copilot-claude-opus-4.6` |
|
|
|
|
When spawning subagents via `sessions_spawn`, pass the `model` parameter:
|
|
```json
|
|
{
|
|
"task": "...",
|
|
"mode": "run",
|
|
"label": "council-pragmatist",
|
|
"model": "zai/glm-4.7"
|
|
}
|
|
```
|
|
|
|
#### 4b. General Subagent Routing Guidelines
|
|
|
|
Encode these in AGENTS.md or a workspace convention file so all future subagent spawns follow the pattern:
|
|
|
|
**Use `zai/glm-4.7` (free) when**:
|
|
- Task is well-defined with clear constraints
|
|
- Output format is specified in the prompt
|
|
- Task is one of: summarization, brainstorming, classification, translation, formatting, simple Q&A
|
|
- Task doesn't require tool use or complex multi-step reasoning
|
|
|
|
**Use `litellm/copilot-claude-sonnet-4.6` (standard) when**:
|
|
- Task requires nuanced reasoning or analysis
|
|
- Task involves code generation or review
|
|
- Output quality is user-facing and high-stakes
|
|
- Task requires understanding subtle context
|
|
|
|
**Use `litellm/copilot-claude-opus-4.6` (frontier) when**:
|
|
- Main interactive session only
|
|
- Final synthesis / referee / meta-arbiter roles
|
|
- Tasks where the user explicitly asked for highest quality
|
|
|
|
#### 4c. Quality Verification Strategy
|
|
|
|
Before switching council and subagents to GLM-4.7, run a quality comparison:
|
|
|
|
1. **Same-topic test**: Run the personality council on a topic we've already tested with Sonnet, but using GLM-4.7 for advisors. Compare output quality side by side.
|
|
2. **Structured output test**: Verify GLM-4.7 follows prompt templates correctly (word count guidance, section headers, role staying).
|
|
3. **Scoring rubric**:
|
|
- Does the advisor stay in character? (yes/no)
|
|
- Is the output substantive (not generic platitudes)? (1-5)
|
|
- Does it follow word count guidance? (within 50% of target)
|
|
- Does it reference specific aspects of the topic? (1-5)
|
|
4. **Minimum quality bar**: If GLM-4.7 scores ≥3.5/5 average on the rubric, it's good enough for advisor roles. Referee always stays on Sonnet+.
|
|
|
|
#### 4d. Prompt Engineering for Cheaper Models
|
|
|
|
Cheaper models need tighter prompts to maintain quality. Key techniques:
|
|
|
|
- **Be more explicit about output format**: Include examples, not just descriptions
|
|
- **Constrain output length more tightly**: "Respond in exactly 3 paragraphs" vs "200-400 words"
|
|
- **Use structured output requests**: Ask for numbered lists, specific headers
|
|
- **Front-load the most important instruction**: Put the role and constraint first, context second
|
|
- **Include a quality check instruction**: "Before responding, verify your output matches the requested format"
|
|
|
|
---
|
|
|
|
## Implementation Order
|
|
|
|
### Step 1: Config changes (Phases 1-3) — Do together, single commit
|
|
|
|
Apply all three config changes to `~/.openclaw/openclaw.json`:
|
|
- `cacheRetention: "long"` on Claude models
|
|
- `heartbeat.every: "25m"` for main, `"55m"` default
|
|
- `contextPruning.mode: "cache-ttl"` with `ttl: "1h"`
|
|
|
|
Restart gateway: `openclaw gateway restart`
|
|
|
|
Verify with `/status` and `/usage full` over next few interactions.
|
|
|
|
### Step 2: Quality test GLM-4.7 for subagent work
|
|
|
|
Run a single council advisor (e.g., Pragmatist) on a known topic using `model: "zai/glm-4.7"` in `sessions_spawn`. Compare output quality against the Sonnet run we already have saved.
|
|
|
|
### Step 3: Update council skill for model tiers
|
|
|
|
If GLM-4.7 passes quality bar, update `skills/council/SKILL.md` and `scripts/council.sh` with the model tier routing table. Update `references/prompts.md` with tighter prompt variants for cheaper models if needed.
|
|
|
|
### Step 4: Update AGENTS.md with subagent routing guidelines
|
|
|
|
Add a section documenting when to use which model tier for subagents, so the convention is followed consistently.
|
|
|
|
### Step 5: Monitor and tune
|
|
|
|
- Track cache hit rates over 1-2 days
|
|
- Monitor if context pruning causes any information loss
|
|
- Adjust heartbeat timing if cache misses are too frequent
|
|
- Tune GLM-4.7 prompts based on observed output quality
|
|
|
|
---
|
|
|
|
## Upstream Safety Rules
|
|
|
|
These are hard constraints. Any implementation that violates them is out of scope.
|
|
|
|
### ❌ Never do
|
|
- Edit files under `~/.npm-global/lib/node_modules/openclaw/` directly (dist, src, docs)
|
|
- Patch or monkey-patch OpenClaw's runtime code, even for emergencies (exception: the existing TUI patch has a tracked upstream PR — document any new ones immediately)
|
|
- Add config keys not documented in OpenClaw's own docs (guessing at undocumented keys can silently break on upgrade)
|
|
- Modify `~/.openclaw/openclaw.json` in a way that would be overwritten or invalidated by `openclaw update`
|
|
- Introduce any middleware, proxy, or hook that intercepts OpenClaw's internal request path
|
|
|
|
### ✅ Safe to do
|
|
- Edit `~/.openclaw/openclaw.json` using documented config knobs (agents, models, diagnostics, contextPruning, etc.)
|
|
- Add/edit workspace files (`~/.openclaw/workspace/`) freely — these are never touched by OpenClaw updates
|
|
- Install/update skills via `clawhub` — skills are workspace-local
|
|
- Run `openclaw gateway restart` after config changes
|
|
- Use `openclaw update status` / `scripts/openclaw-update-safe.sh` to check for upstream updates
|
|
|
|
### Checking before applying
|
|
Before implementing any config change:
|
|
1. Verify the key exists in `/home/openclaw/.npm-global/lib/node_modules/openclaw/docs/` or `https://docs.openclaw.ai`
|
|
2. If undocumented: skip it or open a question/issue — don't guess
|
|
3. After `openclaw update`, re-verify config keys still work (check gateway logs for config parse errors)
|
|
|
|
### Update workflow
|
|
```bash
|
|
# Before updating OpenClaw
|
|
openclaw update status # check what version is available
|
|
# Review changelog for breaking config changes
|
|
openclaw update # update (safe scripts handle local compat)
|
|
openclaw gateway restart # restart to pick up new version
|
|
# Verify gateway health + session model still resolves correctly
|
|
```
|
|
|
|
## What This Does NOT Change
|
|
|
|
- **No OpenClaw code changes**: Everything is config-only in `openclaw.json`
|
|
- **No upstream divergence**: All settings use documented OpenClaw config knobs
|
|
- **No new infrastructure**: No proxy servers, routers, or middleware
|
|
- **Main session stays on Opus**: Only subagents move to cheaper models
|
|
- **Fully reversible**: Remove the config keys to revert to current behavior
|
|
|
|
---
|
|
|
|
## Expected Combined Impact
|
|
|
|
| Optimization | Estimated Savings | Confidence |
|
|
|-------------|-------------------|------------|
|
|
| Prompt caching | 40-60% input token reduction | High |
|
|
| Cache warming via heartbeat | Maintains cache savings across idle | High |
|
|
| Context pruning | 20-30% context size reduction for long sessions | Medium |
|
|
| Subagent model routing | 60-80% subagent cost (free model for bulk work) | Medium (pending quality test) |
|
|
|
|
**Combined**: Significant reduction in Copilot quota burn. Main session quality unchanged. Subagent quality maintained through tighter prompts + quality verification.
|