Files
clawdbot/LLM-ROUTING.md
2026-01-26 22:37:39 -08:00

160 lines
4.5 KiB
Markdown

# LLM Routing Guide
Use the right model for the job. **Local first** when possible.
## Core Principles
1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine)
2. **Long-running tasks** → Local LLMs (no API costs, no rate limits)
3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers)
4. **Check availability** → Local LLMs may not always be running
5. **Cost efficiency** → Local → Copilot → Cloud APIs
## Available Resources
### Local (llama-swap @ :8080)
```bash
# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"
# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
```
| Alias | Model | Best For |
|-------|-------|----------|
| `gemma` | Gemma-3-12B | Fast, balanced, fits fully |
| `qwen3` | Qwen3-30B-A3B | General purpose, quality |
| `coder` | Qwen3-Coder-30B | Code specialist |
| `glm` | GLM-4.7-Flash | Fast reasoning |
| `reasoning` | Ministral-3-14B | Reasoning tasks |
| `gpt-oss` | GPT-OSS-20B | Experimental |
### Homelab (Ollama @ 100.85.116.57:11434)
- Larger models, more capacity
- Still private (your network)
- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'`
### GitHub Copilot (via opencode)
- "Free" with subscription
- Good for one-shot tasks
- Has internet access
### Cloud APIs (Clawdbot/me)
- Most capable (opus, sonnet)
- Best tool integration
- Paid per token
## When to Use What
### 🔒 Sensitive/Private Data → **LOCAL ONLY**
```bash
# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
```
**Never send sensitive data to cloud APIs.**
### ⏱️ Long-Running Tasks → **LOCAL**
```bash
# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
```
- No API timeouts
- No rate limits
- No cost accumulation
### 🚀 Parallel Work → **MULTI-AGENT**
When speed matters, spawn multiple workers:
```bash
# Flynn can spawn sub-agents targeting any LLM
# Each agent works independently, results merge
```
- Use for: bulk analysis, multi-file processing, research tasks
- Coordinate via `sessions_spawn` with model param
- **Local:** best for privacy + no rate limits
- **Cloud:** best for complex tasks needing quality
- Mix and match based on task requirements
### ⚡ Quick One-Shot → **COPILOT or LOCAL**
```bash
# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
opencode run -m github-copilot/claude-haiku-4.5 "quick question"
```
### 🧠 Complex Reasoning → **CLOUD (opus)**
- Multi-step orchestration
- Tool use (browser, APIs, messaging)
- When quality > cost
### 📚 Massive Context (>32k) → **GEMINI**
```bash
cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
```
## Availability Check Pattern
Before using local LLMs, verify they're up:
```bash
# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }
# Use with fallback
if llama_up; then
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
opencode run -m github-copilot/claude-haiku-4.5 "..."
fi
```
For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local.
## Service Management
```bash
# Start local LLMs
systemctl --user start llama-swap
# Stop (save GPU for gaming)
systemctl --user stop llama-swap
# Check status
systemctl --user status llama-swap
```
## Model Selection Matrix
| Scenario | First Choice | Fallback |
|----------|--------------|----------|
| Private data | `qwen3` (local) | — (no fallback) |
| Long task | `coder` (local) | Homelab Ollama |
| Quick question | `gemma` (local) | `haiku` (Copilot) |
| Code review | `coder` (local) | `sonnet` (Copilot) |
| Complex reasoning | `opus` (cloud) | `qwen3` (local) |
| Bulk processing | Multi-agent local | — |
| 100k+ context | `gemini-2.5-pro` | — |
## For Flynn
### Before using local LLMs:
```bash
curl -sf http://127.0.0.1:8080/health
```
### For parallel work:
- Spawn sub-agents with `sessions_spawn`
- Each can hit local endpoint independently
- Coordinate results in main session
### Privacy rule:
If the data is sensitive (credentials, personal info, proprietary), **only use local**.
Never send to cloud APIs without explicit permission.
---
*Principle: Local first. Check availability. Parallelize when beneficial.*