- Privacy/confidentiality: local LLMs only for sensitive data - Check availability before using local (may not be running) - Long-running tasks: local (no API costs/limits) - Multi-agent parallelism for speed - Fallback patterns when local is down
4.4 KiB
4.4 KiB
LLM Routing Guide
Use the right model for the job. Local first when possible.
Core Principles
- Privacy/Confidentiality → Local LLMs (data never leaves the machine)
- Long-running tasks → Local LLMs (no API costs, no rate limits)
- Parallelism → Multi-agent with local LLMs (spawn multiple workers)
- Check availability → Local LLMs may not always be running
- Cost efficiency → Local → Copilot → Cloud APIs
Available Resources
Local (llama-swap @ :8080)
# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"
# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
| Alias | Model | Best For |
|---|---|---|
gemma |
Gemma-3-12B | Fast, balanced, fits fully |
qwen3 |
Qwen3-30B-A3B | General purpose, quality |
coder |
Qwen3-Coder-30B | Code specialist |
glm |
GLM-4.7-Flash | Fast reasoning |
reasoning |
Ministral-3-14B | Reasoning tasks |
gpt-oss |
GPT-OSS-20B | Experimental |
Homelab (Ollama @ 100.85.116.57:11434)
- Larger models, more capacity
- Still private (your network)
- Check:
curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'
GitHub Copilot (via opencode)
- "Free" with subscription
- Good for one-shot tasks
- Has internet access
Cloud APIs (Clawdbot/me)
- Most capable (opus, sonnet)
- Best tool integration
- Paid per token
When to Use What
🔒 Sensitive/Private Data → LOCAL ONLY
# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
Never send sensitive data to cloud APIs.
⏱️ Long-Running Tasks → LOCAL
# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
-d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
- No API timeouts
- No rate limits
- No cost accumulation
🚀 Parallel Work → MULTI-AGENT LOCAL
When speed matters, spawn multiple workers:
# Flynn can spawn sub-agents hitting local LLMs
# Each agent works independently, results merge
- Use for: bulk analysis, multi-file processing, research tasks
- Coordinate via sessions_spawn with local model routing
⚡ Quick One-Shot → COPILOT or LOCAL
# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
opencode run -m github-copilot/claude-haiku-4.5 "quick question"
🧠 Complex Reasoning → CLOUD (opus)
- Multi-step orchestration
- Tool use (browser, APIs, messaging)
- When quality > cost
📚 Massive Context (>32k) → GEMINI
cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
Availability Check Pattern
Before using local LLMs, verify they're up:
# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }
# Use with fallback
if llama_up; then
curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
opencode run -m github-copilot/claude-haiku-4.5 "..."
fi
For Flynn: check curl -sf http://127.0.0.1:8080/health before routing to local.
Service Management
# Start local LLMs
systemctl --user start llama-swap
# Stop (save GPU for gaming)
systemctl --user stop llama-swap
# Check status
systemctl --user status llama-swap
Model Selection Matrix
| Scenario | First Choice | Fallback |
|---|---|---|
| Private data | qwen3 (local) |
— (no fallback) |
| Long task | coder (local) |
Homelab Ollama |
| Quick question | gemma (local) |
haiku (Copilot) |
| Code review | coder (local) |
sonnet (Copilot) |
| Complex reasoning | opus (cloud) |
qwen3 (local) |
| Bulk processing | Multi-agent local | — |
| 100k+ context | gemini-2.5-pro |
— |
For Flynn
Before using local LLMs:
curl -sf http://127.0.0.1:8080/health
For parallel work:
- Spawn sub-agents with
sessions_spawn - Each can hit local endpoint independently
- Coordinate results in main session
Privacy rule:
If the data is sensitive (credentials, personal info, proprietary), only use local. Never send to cloud APIs without explicit permission.
Principle: Local first. Check availability. Parallelize when beneficial.