Rewrite LLM routing with local-first principles
- Privacy/confidentiality: local LLMs only for sensitive data - Check availability before using local (may not be running) - Long-running tasks: local (no API costs/limits) - Multi-agent parallelism for speed - Fallback patterns when local is down
This commit is contained in:
201
LLM-ROUTING.md
201
LLM-ROUTING.md
@@ -1,111 +1,156 @@
|
|||||||
# LLM Routing Guide
|
# LLM Routing Guide
|
||||||
|
|
||||||
Use the right model for the job. Cost and speed matter.
|
Use the right model for the job. **Local first** when possible.
|
||||||
|
|
||||||
## Available CLIs
|
## Core Principles
|
||||||
|
|
||||||
| CLI | Auth | Best For |
|
1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine)
|
||||||
|-----|------|----------|
|
2. **Long-running tasks** → Local LLMs (no API costs, no rate limits)
|
||||||
| `claude` | Pro subscription | Complex reasoning, this workspace |
|
3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers)
|
||||||
| `opencode` | GitHub Copilot subscription | Code, free Copilot models |
|
4. **Check availability** → Local LLMs may not always be running
|
||||||
| `gemini` | Google account (free tier available) | Long context, multimodal |
|
5. **Cost efficiency** → Local → Copilot → Cloud APIs
|
||||||
|
|
||||||
## Model Tiers
|
## Available Resources
|
||||||
|
|
||||||
### ⚡ Fast & Cheap (Simple Tasks)
|
### Local (llama-swap @ :8080)
|
||||||
```bash
|
```bash
|
||||||
# Quick parsing, extraction, formatting, simple questions
|
# Check if running
|
||||||
opencode run -m github-copilot/claude-haiku-4.5 "parse this JSON and extract emails"
|
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"
|
||||||
opencode run -m zai-coding-plan/glm-4.5-flash "summarize in 2 sentences"
|
|
||||||
gemini -m gemini-2.0-flash "quick question here"
|
# List loaded models
|
||||||
|
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
|
||||||
```
|
```
|
||||||
|
|
||||||
**Use for:** Log parsing, data extraction, simple formatting, yes/no questions, summarization
|
| Alias | Model | Best For |
|
||||||
|
|-------|-------|----------|
|
||||||
|
| `gemma` | Gemma-3-12B | Fast, balanced, fits fully |
|
||||||
|
| `qwen3` | Qwen3-30B-A3B | General purpose, quality |
|
||||||
|
| `coder` | Qwen3-Coder-30B | Code specialist |
|
||||||
|
| `glm` | GLM-4.7-Flash | Fast reasoning |
|
||||||
|
| `reasoning` | Ministral-3-14B | Reasoning tasks |
|
||||||
|
| `gpt-oss` | GPT-OSS-20B | Experimental |
|
||||||
|
|
||||||
### 🔧 Balanced (Standard Work)
|
### Homelab (Ollama @ 100.85.116.57:11434)
|
||||||
|
- Larger models, more capacity
|
||||||
|
- Still private (your network)
|
||||||
|
- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'`
|
||||||
|
|
||||||
|
### GitHub Copilot (via opencode)
|
||||||
|
- "Free" with subscription
|
||||||
|
- Good for one-shot tasks
|
||||||
|
- Has internet access
|
||||||
|
|
||||||
|
### Cloud APIs (Clawdbot/me)
|
||||||
|
- Most capable (opus, sonnet)
|
||||||
|
- Best tool integration
|
||||||
|
- Paid per token
|
||||||
|
|
||||||
|
## When to Use What
|
||||||
|
|
||||||
|
### 🔒 Sensitive/Private Data → **LOCAL ONLY**
|
||||||
```bash
|
```bash
|
||||||
# Code review, analysis, standard coding tasks
|
# Credentials, personal info, proprietary code
|
||||||
opencode run -m github-copilot/claude-sonnet-4.5 "review this code"
|
curl http://127.0.0.1:8080/v1/chat/completions \
|
||||||
opencode run -m github-copilot/gpt-5-mini "explain this error"
|
-d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
|
||||||
gemini -m gemini-2.5-pro "analyze this architecture"
|
```
|
||||||
|
**Never send sensitive data to cloud APIs.**
|
||||||
|
|
||||||
|
### ⏱️ Long-Running Tasks → **LOCAL**
|
||||||
|
```bash
|
||||||
|
# Analysis that takes minutes, batch processing
|
||||||
|
curl http://127.0.0.1:8080/v1/chat/completions \
|
||||||
|
-d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
|
||||||
|
```
|
||||||
|
- No API timeouts
|
||||||
|
- No rate limits
|
||||||
|
- No cost accumulation
|
||||||
|
|
||||||
|
### 🚀 Parallel Work → **MULTI-AGENT LOCAL**
|
||||||
|
When speed matters, spawn multiple workers:
|
||||||
|
```bash
|
||||||
|
# Flynn can spawn sub-agents hitting local LLMs
|
||||||
|
# Each agent works independently, results merge
|
||||||
|
```
|
||||||
|
- Use for: bulk analysis, multi-file processing, research tasks
|
||||||
|
- Coordinate via sessions_spawn with local model routing
|
||||||
|
|
||||||
|
### ⚡ Quick One-Shot → **COPILOT or LOCAL**
|
||||||
|
```bash
|
||||||
|
# If local is up, prefer it
|
||||||
|
curl -sf http://127.0.0.1:8080/health && \
|
||||||
|
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
|
||||||
|
opencode run -m github-copilot/claude-haiku-4.5 "quick question"
|
||||||
```
|
```
|
||||||
|
|
||||||
**Use for:** Code generation, debugging, analysis, documentation
|
### 🧠 Complex Reasoning → **CLOUD (opus)**
|
||||||
|
- Multi-step orchestration
|
||||||
|
- Tool use (browser, APIs, messaging)
|
||||||
|
- When quality > cost
|
||||||
|
|
||||||
### 🧠 Powerful (Complex Reasoning)
|
### 📚 Massive Context (>32k) → **GEMINI**
|
||||||
```bash
|
```bash
|
||||||
# Complex reasoning, multi-step planning, difficult problems
|
cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
|
||||||
claude -p --model opus "design a system for X"
|
|
||||||
opencode run -m github-copilot/gpt-5.2 "complex reasoning task"
|
|
||||||
opencode run -m github-copilot/gemini-3-pro-preview "architectural decision"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Use for:** Architecture decisions, complex debugging, multi-step planning
|
## Availability Check Pattern
|
||||||
|
|
||||||
|
Before using local LLMs, verify they're up:
|
||||||
|
|
||||||
### 📚 Long Context
|
|
||||||
```bash
|
```bash
|
||||||
# Large codebases, long documents, big context windows
|
# Quick check
|
||||||
gemini -m gemini-2.5-pro "analyze this entire codebase" < large_file.txt
|
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }
|
||||||
opencode run -m github-copilot/gemini-3-pro-preview "summarize all these files"
|
|
||||||
|
# Use with fallback
|
||||||
|
if llama_up; then
|
||||||
|
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
|
||||||
|
else
|
||||||
|
opencode run -m github-copilot/claude-haiku-4.5 "..."
|
||||||
|
fi
|
||||||
```
|
```
|
||||||
|
|
||||||
**Use for:** Analyzing large files, long documents, full codebase understanding
|
For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local.
|
||||||
|
|
||||||
## Quick Reference
|
## Service Management
|
||||||
|
|
||||||
| Task | Model | CLI Command |
|
|
||||||
|------|-------|-------------|
|
|
||||||
| Parse JSON/logs | haiku | `opencode run -m github-copilot/claude-haiku-4.5 "..."` |
|
|
||||||
| Simple summary | flash | `gemini -m gemini-2.0-flash "..."` |
|
|
||||||
| Code review | sonnet | `opencode run -m github-copilot/claude-sonnet-4.5 "..."` |
|
|
||||||
| Write code | codex | `opencode run -m github-copilot/gpt-5.1-codex "..."` |
|
|
||||||
| Debug complex issue | sonnet/opus | `claude -p --model sonnet "..."` |
|
|
||||||
| Architecture design | opus | `claude -p --model opus "..."` |
|
|
||||||
| Analyze large file | gemini-pro | `gemini -m gemini-2.5-pro "..." < file` |
|
|
||||||
| Quick kubectl help | flash | `opencode run -m zai-coding-plan/glm-4.5-flash "..."` |
|
|
||||||
|
|
||||||
## Cost Optimization Rules
|
|
||||||
|
|
||||||
1. **Start small** — Try haiku/flash first, escalate only if needed
|
|
||||||
2. **Batch similar tasks** — One opus call > five haiku calls for complex work
|
|
||||||
3. **Use subscriptions** — GitHub Copilot models are "free" with subscription
|
|
||||||
4. **Cache results** — Don't re-ask the same question
|
|
||||||
5. **Context matters** — Smaller context = faster + cheaper
|
|
||||||
|
|
||||||
## Example Workflows
|
|
||||||
|
|
||||||
### Triage emails (cheap)
|
|
||||||
```bash
|
```bash
|
||||||
opencode run -m github-copilot/claude-haiku-4.5 "categorize these emails as urgent/normal/spam"
|
# Start local LLMs
|
||||||
|
systemctl --user start llama-swap
|
||||||
|
|
||||||
|
# Stop (save GPU for gaming)
|
||||||
|
systemctl --user stop llama-swap
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
systemctl --user status llama-swap
|
||||||
```
|
```
|
||||||
|
|
||||||
### Code review (balanced)
|
## Model Selection Matrix
|
||||||
|
|
||||||
|
| Scenario | First Choice | Fallback |
|
||||||
|
|----------|--------------|----------|
|
||||||
|
| Private data | `qwen3` (local) | — (no fallback) |
|
||||||
|
| Long task | `coder` (local) | Homelab Ollama |
|
||||||
|
| Quick question | `gemma` (local) | `haiku` (Copilot) |
|
||||||
|
| Code review | `coder` (local) | `sonnet` (Copilot) |
|
||||||
|
| Complex reasoning | `opus` (cloud) | `qwen3` (local) |
|
||||||
|
| Bulk processing | Multi-agent local | — |
|
||||||
|
| 100k+ context | `gemini-2.5-pro` | — |
|
||||||
|
|
||||||
|
## For Flynn
|
||||||
|
|
||||||
|
### Before using local LLMs:
|
||||||
```bash
|
```bash
|
||||||
opencode run -m github-copilot/claude-sonnet-4.5 "review this PR for issues"
|
curl -sf http://127.0.0.1:8080/health
|
||||||
```
|
```
|
||||||
|
|
||||||
### Architectural decision (powerful)
|
### For parallel work:
|
||||||
```bash
|
- Spawn sub-agents with `sessions_spawn`
|
||||||
claude -p --model opus "given these constraints, design the best approach for..."
|
- Each can hit local endpoint independently
|
||||||
```
|
- Coordinate results in main session
|
||||||
|
|
||||||
### Summarize long doc (long context)
|
### Privacy rule:
|
||||||
```bash
|
If the data is sensitive (credentials, personal info, proprietary), **only use local**.
|
||||||
cat huge_document.md | gemini -m gemini-2.5-pro "summarize key points"
|
Never send to cloud APIs without explicit permission.
|
||||||
```
|
|
||||||
|
|
||||||
## For Flynn (Clawdbot)
|
|
||||||
|
|
||||||
When spawning sub-agents or doing background work:
|
|
||||||
- Use `sessions_spawn` with appropriate model hints
|
|
||||||
- For simple extraction: spawn with default (cheaper model)
|
|
||||||
- For complex analysis: explicitly request opus
|
|
||||||
|
|
||||||
When using exec to call CLIs:
|
|
||||||
- Prefer `opencode run` for one-shot tasks (GitHub Copilot = included)
|
|
||||||
- Use `claude -p` when you need Claude-specific capabilities
|
|
||||||
- Use `gemini` for very long context or multimodal
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
*Principle: Don't use a sledgehammer to hang a picture.*
|
*Principle: Local first. Check availability. Parallelize when beneficial.*
|
||||||
|
|||||||
Reference in New Issue
Block a user