# LLM Routing Guide Use the right model for the job. **Local first** when possible. ## Core Principles 1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine) 2. **Long-running tasks** → Local LLMs (no API costs, no rate limits) 3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers) 4. **Check availability** → Local LLMs may not always be running 5. **Cost efficiency** → Local → Copilot → Cloud APIs ## Available Resources ### Local (llama-swap @ :8080) ```bash # Check if running curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN" # List loaded models curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id' ``` | Alias | Model | Best For | |-------|-------|----------| | `gemma` | Gemma-3-12B | Fast, balanced, fits fully | | `qwen3` | Qwen3-30B-A3B | General purpose, quality | | `coder` | Qwen3-Coder-30B | Code specialist | | `glm` | GLM-4.7-Flash | Fast reasoning | | `reasoning` | Ministral-3-14B | Reasoning tasks | | `gpt-oss` | GPT-OSS-20B | Experimental | ### Homelab (Ollama @ 100.85.116.57:11434) - Larger models, more capacity - Still private (your network) - Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'` ### GitHub Copilot (via opencode) - "Free" with subscription - Good for one-shot tasks - Has internet access ### Cloud APIs (Clawdbot/me) - Most capable (opus, sonnet) - Best tool integration - Paid per token ## When to Use What ### 🔒 Sensitive/Private Data → **LOCAL ONLY** ```bash # Credentials, personal info, proprietary code curl http://127.0.0.1:8080/v1/chat/completions \ -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}' ``` **Never send sensitive data to cloud APIs.** ### ⏱️ Long-Running Tasks → **LOCAL** ```bash # Analysis that takes minutes, batch processing curl http://127.0.0.1:8080/v1/chat/completions \ -d '{"model": "coder", "messages": [...], "max_tokens": 4096}' ``` - No API timeouts - No rate limits - No cost accumulation ### 🚀 Parallel Work → **MULTI-AGENT** When speed matters, spawn multiple workers: ```bash # Flynn can spawn sub-agents targeting any LLM # Each agent works independently, results merge ``` - Use for: bulk analysis, multi-file processing, research tasks - Coordinate via `sessions_spawn` with model param - **Local:** best for privacy + no rate limits - **Cloud:** best for complex tasks needing quality - Mix and match based on task requirements ### ⚡ Quick One-Shot → **COPILOT or LOCAL** ```bash # If local is up, prefer it curl -sf http://127.0.0.1:8080/health && \ curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \ opencode run -m github-copilot/claude-haiku-4.5 "quick question" ``` ### 🧠 Complex Reasoning → **CLOUD (opus)** - Multi-step orchestration - Tool use (browser, APIs, messaging) - When quality > cost ### 📚 Massive Context (>32k) → **GEMINI** ```bash cat huge_file.md | gemini -m gemini-2.5-pro "analyze" ``` ## Availability Check Pattern Before using local LLMs, verify they're up: ```bash # Quick check llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; } # Use with fallback if llama_up; then curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' else opencode run -m github-copilot/claude-haiku-4.5 "..." fi ``` For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local. ## Service Management ```bash # Start local LLMs systemctl --user start llama-swap # Stop (save GPU for gaming) systemctl --user stop llama-swap # Check status systemctl --user status llama-swap ``` ## Model Selection Matrix | Scenario | First Choice | Fallback | |----------|--------------|----------| | Private data | `qwen3` (local) | — (no fallback) | | Long task | `coder` (local) | Homelab Ollama | | Quick question | `gemma` (local) | `haiku` (Copilot) | | Code review | `coder` (local) | `sonnet` (Copilot) | | Complex reasoning | `opus` (cloud) | `qwen3` (local) | | Bulk processing | Multi-agent local | — | | 100k+ context | `gemini-2.5-pro` | — | ## For Flynn ### Before using local LLMs: ```bash curl -sf http://127.0.0.1:8080/health ``` ### For parallel work: - Spawn sub-agents with `sessions_spawn` - Each can hit local endpoint independently - Coordinate results in main session ### Privacy rule: If the data is sensitive (credentials, personal info, proprietary), **only use local**. Never send to cloud APIs without explicit permission. --- *Principle: Local first. Check availability. Parallelize when beneficial.*