Rewrite LLM routing with local-first principles

- Privacy/confidentiality: local LLMs only for sensitive data - Check availability before using local (may not be running) - Long-running tasks: local (no API costs/limits) - Multi-agent parallelism for speed - Fallback patterns when local is down
2026-01-26 22:34:26 -08:00
parent fece6b59c5
commit 3a3838e19b
1 changed files with 123 additions and 78 deletions
--- a/LLM-ROUTING.md
+++ b/LLM-ROUTING.md
@@ -1,111 +1,156 @@
 # LLM Routing Guide

-Use the right model for the job. Cost and speed matter.
+Use the right model for the job. **Local first** when possible.

-## Available CLIs
+## Core Principles

-| CLI | Auth | Best For |
-|-----|------|----------|
-| `claude` | Pro subscription | Complex reasoning, this workspace |
-| `opencode` | GitHub Copilot subscription | Code, free Copilot models |
-| `gemini` | Google account (free tier available) | Long context, multimodal |
+1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine)
+2. **Long-running tasks** → Local LLMs (no API costs, no rate limits)
+3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers)
+4. **Check availability** → Local LLMs may not always be running
+5. **Cost efficiency** → Local → Copilot → Cloud APIs

-## Model Tiers
+## Available Resources

-### ⚡ Fast & Cheap (Simple Tasks)
+### Local (llama-swap @ :8080)
 ```bash
-# Quick parsing, extraction, formatting, simple questions
-opencode run -m github-copilot/claude-haiku-4.5 "parse this JSON and extract emails"
-opencode run -m zai-coding-plan/glm-4.5-flash "summarize in 2 sentences"
-gemini -m gemini-2.0-flash "quick question here"
+# Check if running
+curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"
+
+# List loaded models
+curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
 ```

-**Use for:** Log parsing, data extraction, simple formatting, yes/no questions, summarization
+| Alias | Model | Best For |
+|-------|-------|----------|
+| `gemma` | Gemma-3-12B | Fast, balanced, fits fully |
+| `qwen3` | Qwen3-30B-A3B | General purpose, quality |
+| `coder` | Qwen3-Coder-30B | Code specialist |
+| `glm` | GLM-4.7-Flash | Fast reasoning |
+| `reasoning` | Ministral-3-14B | Reasoning tasks |
+| `gpt-oss` | GPT-OSS-20B | Experimental |

-### 🔧 Balanced (Standard Work)
+### Homelab (Ollama @ 100.85.116.57:11434)
+- Larger models, more capacity
+- Still private (your network)
+- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'`
+
+### GitHub Copilot (via opencode)
+- "Free" with subscription
+- Good for one-shot tasks
+- Has internet access
+
+### Cloud APIs (Clawdbot/me)
+- Most capable (opus, sonnet)
+- Best tool integration
+- Paid per token
+
+## When to Use What
+
+### 🔒 Sensitive/Private Data → **LOCAL ONLY**
 ```bash
-# Code review, analysis, standard coding tasks
-opencode run -m github-copilot/claude-sonnet-4.5 "review this code"
-opencode run -m github-copilot/gpt-5-mini "explain this error"
-gemini -m gemini-2.5-pro "analyze this architecture"
+# Credentials, personal info, proprietary code
+curl http://127.0.0.1:8080/v1/chat/completions \
+  -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
+```
+**Never send sensitive data to cloud APIs.**
+
+### ⏱️ Long-Running Tasks → **LOCAL**
+```bash
+# Analysis that takes minutes, batch processing
+curl http://127.0.0.1:8080/v1/chat/completions \
+  -d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
+```
+- No API timeouts
+- No rate limits
+- No cost accumulation
+
+### 🚀 Parallel Work → **MULTI-AGENT LOCAL**
+When speed matters, spawn multiple workers:
+```bash
+# Flynn can spawn sub-agents hitting local LLMs
+# Each agent works independently, results merge
+```
+- Use for: bulk analysis, multi-file processing, research tasks
+- Coordinate via sessions_spawn with local model routing
+
+### ⚡ Quick One-Shot → **COPILOT or LOCAL**
+```bash
+# If local is up, prefer it
+curl -sf http://127.0.0.1:8080/health && \
+  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
+  opencode run -m github-copilot/claude-haiku-4.5 "quick question"
 ```

-**Use for:** Code generation, debugging, analysis, documentation
+### 🧠 Complex Reasoning → **CLOUD (opus)**
+- Multi-step orchestration
+- Tool use (browser, APIs, messaging)
+- When quality > cost

-### 🧠 Powerful (Complex Reasoning)
+### 📚 Massive Context (>32k) → **GEMINI**
 ```bash
-# Complex reasoning, multi-step planning, difficult problems
-claude -p --model opus "design a system for X"
-opencode run -m github-copilot/gpt-5.2 "complex reasoning task"
-opencode run -m github-copilot/gemini-3-pro-preview "architectural decision"
+cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
 ```

-**Use for:** Architecture decisions, complex debugging, multi-step planning
+## Availability Check Pattern
+
+Before using local LLMs, verify they're up:

-### 📚 Long Context
 ```bash
-# Large codebases, long documents, big context windows
-gemini -m gemini-2.5-pro "analyze this entire codebase" < large_file.txt
-opencode run -m github-copilot/gemini-3-pro-preview "summarize all these files"
+# Quick check
+llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }
+
+# Use with fallback
+if llama_up; then
+  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
+else
+  opencode run -m github-copilot/claude-haiku-4.5 "..."
+fi
 ```

-**Use for:** Analyzing large files, long documents, full codebase understanding
+For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local.

-## Quick Reference
+## Service Management

-| Task | Model | CLI Command |
-|------|-------|-------------|
-| Parse JSON/logs | haiku | `opencode run -m github-copilot/claude-haiku-4.5 "..."` |
-| Simple summary | flash | `gemini -m gemini-2.0-flash "..."` |
-| Code review | sonnet | `opencode run -m github-copilot/claude-sonnet-4.5 "..."` |
-| Write code | codex | `opencode run -m github-copilot/gpt-5.1-codex "..."` |
-| Debug complex issue | sonnet/opus | `claude -p --model sonnet "..."` |
-| Architecture design | opus | `claude -p --model opus "..."` |
-| Analyze large file | gemini-pro | `gemini -m gemini-2.5-pro "..." < file` |
-| Quick kubectl help | flash | `opencode run -m zai-coding-plan/glm-4.5-flash "..."` |
-
-## Cost Optimization Rules
-
-1. **Start small** — Try haiku/flash first, escalate only if needed
-2. **Batch similar tasks** — One opus call > five haiku calls for complex work
-3. **Use subscriptions** — GitHub Copilot models are "free" with subscription
-4. **Cache results** — Don't re-ask the same question
-5. **Context matters** — Smaller context = faster + cheaper
-
-## Example Workflows
-
-### Triage emails (cheap)
 ```bash
-opencode run -m github-copilot/claude-haiku-4.5 "categorize these emails as urgent/normal/spam"
+# Start local LLMs
+systemctl --user start llama-swap
+
+# Stop (save GPU for gaming)
+systemctl --user stop llama-swap
+
+# Check status
+systemctl --user status llama-swap
 ```

-### Code review (balanced)
+## Model Selection Matrix
+
+| Scenario | First Choice | Fallback |
+|----------|--------------|----------|
+| Private data | `qwen3` (local) | — (no fallback) |
+| Long task | `coder` (local) | Homelab Ollama |
+| Quick question | `gemma` (local) | `haiku` (Copilot) |
+| Code review | `coder` (local) | `sonnet` (Copilot) |
+| Complex reasoning | `opus` (cloud) | `qwen3` (local) |
+| Bulk processing | Multi-agent local | — |
+| 100k+ context | `gemini-2.5-pro` | — |
+
+## For Flynn
+
+### Before using local LLMs:
 ```bash
-opencode run -m github-copilot/claude-sonnet-4.5 "review this PR for issues"
+curl -sf http://127.0.0.1:8080/health
 ```

-### Architectural decision (powerful)
-```bash
-claude -p --model opus "given these constraints, design the best approach for..."
-```
+### For parallel work:
+- Spawn sub-agents with `sessions_spawn`
+- Each can hit local endpoint independently
+- Coordinate results in main session

-### Summarize long doc (long context)
-```bash
-cat huge_document.md | gemini -m gemini-2.5-pro "summarize key points"
-```
-
-## For Flynn (Clawdbot)
-
-When spawning sub-agents or doing background work:
- Use `sessions_spawn` with appropriate model hints
- For simple extraction: spawn with default (cheaper model)
- For complex analysis: explicitly request opus
-
-When using exec to call CLIs:
- Prefer `opencode run` for one-shot tasks (GitHub Copilot = included)
- Use `claude -p` when you need Claude-specific capabilities
- Use `gemini` for very long context or multimodal
+### Privacy rule:
+If the data is sensitive (credentials, personal info, proprietary), **only use local**.
+Never send to cloud APIs without explicit permission.

 ---

-*Principle: Don't use a sledgehammer to hang a picture.*
+*Principle: Local first. Check availability. Parallelize when beneficial.*