Rewrite LLM routing with local-first principles

- Privacy/confidentiality: local LLMs only for sensitive data
- Check availability before using local (may not be running)
- Long-running tasks: local (no API costs/limits)
- Multi-agent parallelism for speed
- Fallback patterns when local is down
This commit is contained in:
William Valentin
2026-01-26 22:34:26 -08:00
parent fece6b59c5
commit 3a3838e19b

View File

@@ -1,111 +1,156 @@
# LLM Routing Guide
Use the right model for the job. Cost and speed matter.
Use the right model for the job. **Local first** when possible.
## Available CLIs
## Core Principles
| CLI | Auth | Best For |
|-----|------|----------|
| `claude` | Pro subscription | Complex reasoning, this workspace |
| `opencode` | GitHub Copilot subscription | Code, free Copilot models |
| `gemini` | Google account (free tier available) | Long context, multimodal |
1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine)
2. **Long-running tasks** → Local LLMs (no API costs, no rate limits)
3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers)
4. **Check availability** → Local LLMs may not always be running
5. **Cost efficiency** → Local → Copilot → Cloud APIs
## Model Tiers
## Available Resources
### ⚡ Fast & Cheap (Simple Tasks)
### Local (llama-swap @ :8080)
```bash
# Quick parsing, extraction, formatting, simple questions
opencode run -m github-copilot/claude-haiku-4.5 "parse this JSON and extract emails"
opencode run -m zai-coding-plan/glm-4.5-flash "summarize in 2 sentences"
gemini -m gemini-2.0-flash "quick question here"
# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"
# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
```
**Use for:** Log parsing, data extraction, simple formatting, yes/no questions, summarization
| Alias | Model | Best For |
|-------|-------|----------|
| `gemma` | Gemma-3-12B | Fast, balanced, fits fully |
| `qwen3` | Qwen3-30B-A3B | General purpose, quality |
| `coder` | Qwen3-Coder-30B | Code specialist |
| `glm` | GLM-4.7-Flash | Fast reasoning |
| `reasoning` | Ministral-3-14B | Reasoning tasks |
| `gpt-oss` | GPT-OSS-20B | Experimental |
### 🔧 Balanced (Standard Work)
### Homelab (Ollama @ 100.85.116.57:11434)
- Larger models, more capacity
- Still private (your network)
- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'`
### GitHub Copilot (via opencode)
- "Free" with subscription
- Good for one-shot tasks
- Has internet access
### Cloud APIs (Clawdbot/me)
- Most capable (opus, sonnet)
- Best tool integration
- Paid per token
## When to Use What
### 🔒 Sensitive/Private Data → **LOCAL ONLY**
```bash
# Code review, analysis, standard coding tasks
opencode run -m github-copilot/claude-sonnet-4.5 "review this code"
opencode run -m github-copilot/gpt-5-mini "explain this error"
gemini -m gemini-2.5-pro "analyze this architecture"
# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
```
**Never send sensitive data to cloud APIs.**
### ⏱️ Long-Running Tasks → **LOCAL**
```bash
# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
```
- No API timeouts
- No rate limits
- No cost accumulation
### 🚀 Parallel Work → **MULTI-AGENT LOCAL**
When speed matters, spawn multiple workers:
```bash
# Flynn can spawn sub-agents hitting local LLMs
# Each agent works independently, results merge
```
- Use for: bulk analysis, multi-file processing, research tasks
- Coordinate via sessions_spawn with local model routing
### ⚡ Quick One-Shot → **COPILOT or LOCAL**
```bash
# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
opencode run -m github-copilot/claude-haiku-4.5 "quick question"
```
**Use for:** Code generation, debugging, analysis, documentation
### 🧠 Complex Reasoning → **CLOUD (opus)**
- Multi-step orchestration
- Tool use (browser, APIs, messaging)
- When quality > cost
### 🧠 Powerful (Complex Reasoning)
### 📚 Massive Context (>32k) → **GEMINI**
```bash
# Complex reasoning, multi-step planning, difficult problems
claude -p --model opus "design a system for X"
opencode run -m github-copilot/gpt-5.2 "complex reasoning task"
opencode run -m github-copilot/gemini-3-pro-preview "architectural decision"
cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
```
**Use for:** Architecture decisions, complex debugging, multi-step planning
## Availability Check Pattern
Before using local LLMs, verify they're up:
### 📚 Long Context
```bash
# Large codebases, long documents, big context windows
gemini -m gemini-2.5-pro "analyze this entire codebase" < large_file.txt
opencode run -m github-copilot/gemini-3-pro-preview "summarize all these files"
# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }
# Use with fallback
if llama_up; then
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
opencode run -m github-copilot/claude-haiku-4.5 "..."
fi
```
**Use for:** Analyzing large files, long documents, full codebase understanding
For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local.
## Quick Reference
## Service Management
| Task | Model | CLI Command |
|------|-------|-------------|
| Parse JSON/logs | haiku | `opencode run -m github-copilot/claude-haiku-4.5 "..."` |
| Simple summary | flash | `gemini -m gemini-2.0-flash "..."` |
| Code review | sonnet | `opencode run -m github-copilot/claude-sonnet-4.5 "..."` |
| Write code | codex | `opencode run -m github-copilot/gpt-5.1-codex "..."` |
| Debug complex issue | sonnet/opus | `claude -p --model sonnet "..."` |
| Architecture design | opus | `claude -p --model opus "..."` |
| Analyze large file | gemini-pro | `gemini -m gemini-2.5-pro "..." < file` |
| Quick kubectl help | flash | `opencode run -m zai-coding-plan/glm-4.5-flash "..."` |
## Cost Optimization Rules
1. **Start small** — Try haiku/flash first, escalate only if needed
2. **Batch similar tasks** — One opus call > five haiku calls for complex work
3. **Use subscriptions** — GitHub Copilot models are "free" with subscription
4. **Cache results** — Don't re-ask the same question
5. **Context matters** — Smaller context = faster + cheaper
## Example Workflows
### Triage emails (cheap)
```bash
opencode run -m github-copilot/claude-haiku-4.5 "categorize these emails as urgent/normal/spam"
# Start local LLMs
systemctl --user start llama-swap
# Stop (save GPU for gaming)
systemctl --user stop llama-swap
# Check status
systemctl --user status llama-swap
```
### Code review (balanced)
## Model Selection Matrix
| Scenario | First Choice | Fallback |
|----------|--------------|----------|
| Private data | `qwen3` (local) | — (no fallback) |
| Long task | `coder` (local) | Homelab Ollama |
| Quick question | `gemma` (local) | `haiku` (Copilot) |
| Code review | `coder` (local) | `sonnet` (Copilot) |
| Complex reasoning | `opus` (cloud) | `qwen3` (local) |
| Bulk processing | Multi-agent local | — |
| 100k+ context | `gemini-2.5-pro` | — |
## For Flynn
### Before using local LLMs:
```bash
opencode run -m github-copilot/claude-sonnet-4.5 "review this PR for issues"
curl -sf http://127.0.0.1:8080/health
```
### Architectural decision (powerful)
```bash
claude -p --model opus "given these constraints, design the best approach for..."
```
### For parallel work:
- Spawn sub-agents with `sessions_spawn`
- Each can hit local endpoint independently
- Coordinate results in main session
### Summarize long doc (long context)
```bash
cat huge_document.md | gemini -m gemini-2.5-pro "summarize key points"
```
## For Flynn (Clawdbot)
When spawning sub-agents or doing background work:
- Use `sessions_spawn` with appropriate model hints
- For simple extraction: spawn with default (cheaper model)
- For complex analysis: explicitly request opus
When using exec to call CLIs:
- Prefer `opencode run` for one-shot tasks (GitHub Copilot = included)
- Use `claude -p` when you need Claude-specific capabilities
- Use `gemini` for very long context or multimodal
### Privacy rule:
If the data is sensitive (credentials, personal info, proprietary), **only use local**.
Never send to cloud APIs without explicit permission.
---
*Principle: Don't use a sledgehammer to hang a picture.*
*Principle: Local first. Check availability. Parallelize when beneficial.*