From 3a3838e19bc301c172e15a30128f04af329c9351 Mon Sep 17 00:00:00 2001 From: William Valentin Date: Mon, 26 Jan 2026 22:34:26 -0800 Subject: [PATCH] Rewrite LLM routing with local-first principles - Privacy/confidentiality: local LLMs only for sensitive data - Check availability before using local (may not be running) - Long-running tasks: local (no API costs/limits) - Multi-agent parallelism for speed - Fallback patterns when local is down --- LLM-ROUTING.md | 201 ++++++++++++++++++++++++++++++------------------- 1 file changed, 123 insertions(+), 78 deletions(-) diff --git a/LLM-ROUTING.md b/LLM-ROUTING.md index 93a0503..bc4e9af 100644 --- a/LLM-ROUTING.md +++ b/LLM-ROUTING.md @@ -1,111 +1,156 @@ # LLM Routing Guide -Use the right model for the job. Cost and speed matter. +Use the right model for the job. **Local first** when possible. -## Available CLIs +## Core Principles -| CLI | Auth | Best For | -|-----|------|----------| -| `claude` | Pro subscription | Complex reasoning, this workspace | -| `opencode` | GitHub Copilot subscription | Code, free Copilot models | -| `gemini` | Google account (free tier available) | Long context, multimodal | +1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine) +2. **Long-running tasks** → Local LLMs (no API costs, no rate limits) +3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers) +4. **Check availability** → Local LLMs may not always be running +5. **Cost efficiency** → Local → Copilot → Cloud APIs -## Model Tiers +## Available Resources -### ⚡ Fast & Cheap (Simple Tasks) +### Local (llama-swap @ :8080) ```bash -# Quick parsing, extraction, formatting, simple questions -opencode run -m github-copilot/claude-haiku-4.5 "parse this JSON and extract emails" -opencode run -m zai-coding-plan/glm-4.5-flash "summarize in 2 sentences" -gemini -m gemini-2.0-flash "quick question here" +# Check if running +curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN" + +# List loaded models +curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id' ``` -**Use for:** Log parsing, data extraction, simple formatting, yes/no questions, summarization +| Alias | Model | Best For | +|-------|-------|----------| +| `gemma` | Gemma-3-12B | Fast, balanced, fits fully | +| `qwen3` | Qwen3-30B-A3B | General purpose, quality | +| `coder` | Qwen3-Coder-30B | Code specialist | +| `glm` | GLM-4.7-Flash | Fast reasoning | +| `reasoning` | Ministral-3-14B | Reasoning tasks | +| `gpt-oss` | GPT-OSS-20B | Experimental | -### 🔧 Balanced (Standard Work) +### Homelab (Ollama @ 100.85.116.57:11434) +- Larger models, more capacity +- Still private (your network) +- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'` + +### GitHub Copilot (via opencode) +- "Free" with subscription +- Good for one-shot tasks +- Has internet access + +### Cloud APIs (Clawdbot/me) +- Most capable (opus, sonnet) +- Best tool integration +- Paid per token + +## When to Use What + +### 🔒 Sensitive/Private Data → **LOCAL ONLY** ```bash -# Code review, analysis, standard coding tasks -opencode run -m github-copilot/claude-sonnet-4.5 "review this code" -opencode run -m github-copilot/gpt-5-mini "explain this error" -gemini -m gemini-2.5-pro "analyze this architecture" +# Credentials, personal info, proprietary code +curl http://127.0.0.1:8080/v1/chat/completions \ + -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}' +``` +**Never send sensitive data to cloud APIs.** + +### ⏱️ Long-Running Tasks → **LOCAL** +```bash +# Analysis that takes minutes, batch processing +curl http://127.0.0.1:8080/v1/chat/completions \ + -d '{"model": "coder", "messages": [...], "max_tokens": 4096}' +``` +- No API timeouts +- No rate limits +- No cost accumulation + +### 🚀 Parallel Work → **MULTI-AGENT LOCAL** +When speed matters, spawn multiple workers: +```bash +# Flynn can spawn sub-agents hitting local LLMs +# Each agent works independently, results merge +``` +- Use for: bulk analysis, multi-file processing, research tasks +- Coordinate via sessions_spawn with local model routing + +### ⚡ Quick One-Shot → **COPILOT or LOCAL** +```bash +# If local is up, prefer it +curl -sf http://127.0.0.1:8080/health && \ + curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \ + opencode run -m github-copilot/claude-haiku-4.5 "quick question" ``` -**Use for:** Code generation, debugging, analysis, documentation +### 🧠 Complex Reasoning → **CLOUD (opus)** +- Multi-step orchestration +- Tool use (browser, APIs, messaging) +- When quality > cost -### 🧠 Powerful (Complex Reasoning) +### 📚 Massive Context (>32k) → **GEMINI** ```bash -# Complex reasoning, multi-step planning, difficult problems -claude -p --model opus "design a system for X" -opencode run -m github-copilot/gpt-5.2 "complex reasoning task" -opencode run -m github-copilot/gemini-3-pro-preview "architectural decision" +cat huge_file.md | gemini -m gemini-2.5-pro "analyze" ``` -**Use for:** Architecture decisions, complex debugging, multi-step planning +## Availability Check Pattern + +Before using local LLMs, verify they're up: -### 📚 Long Context ```bash -# Large codebases, long documents, big context windows -gemini -m gemini-2.5-pro "analyze this entire codebase" < large_file.txt -opencode run -m github-copilot/gemini-3-pro-preview "summarize all these files" +# Quick check +llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; } + +# Use with fallback +if llama_up; then + curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' +else + opencode run -m github-copilot/claude-haiku-4.5 "..." +fi ``` -**Use for:** Analyzing large files, long documents, full codebase understanding +For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local. -## Quick Reference +## Service Management -| Task | Model | CLI Command | -|------|-------|-------------| -| Parse JSON/logs | haiku | `opencode run -m github-copilot/claude-haiku-4.5 "..."` | -| Simple summary | flash | `gemini -m gemini-2.0-flash "..."` | -| Code review | sonnet | `opencode run -m github-copilot/claude-sonnet-4.5 "..."` | -| Write code | codex | `opencode run -m github-copilot/gpt-5.1-codex "..."` | -| Debug complex issue | sonnet/opus | `claude -p --model sonnet "..."` | -| Architecture design | opus | `claude -p --model opus "..."` | -| Analyze large file | gemini-pro | `gemini -m gemini-2.5-pro "..." < file` | -| Quick kubectl help | flash | `opencode run -m zai-coding-plan/glm-4.5-flash "..."` | - -## Cost Optimization Rules - -1. **Start small** — Try haiku/flash first, escalate only if needed -2. **Batch similar tasks** — One opus call > five haiku calls for complex work -3. **Use subscriptions** — GitHub Copilot models are "free" with subscription -4. **Cache results** — Don't re-ask the same question -5. **Context matters** — Smaller context = faster + cheaper - -## Example Workflows - -### Triage emails (cheap) ```bash -opencode run -m github-copilot/claude-haiku-4.5 "categorize these emails as urgent/normal/spam" +# Start local LLMs +systemctl --user start llama-swap + +# Stop (save GPU for gaming) +systemctl --user stop llama-swap + +# Check status +systemctl --user status llama-swap ``` -### Code review (balanced) +## Model Selection Matrix + +| Scenario | First Choice | Fallback | +|----------|--------------|----------| +| Private data | `qwen3` (local) | — (no fallback) | +| Long task | `coder` (local) | Homelab Ollama | +| Quick question | `gemma` (local) | `haiku` (Copilot) | +| Code review | `coder` (local) | `sonnet` (Copilot) | +| Complex reasoning | `opus` (cloud) | `qwen3` (local) | +| Bulk processing | Multi-agent local | — | +| 100k+ context | `gemini-2.5-pro` | — | + +## For Flynn + +### Before using local LLMs: ```bash -opencode run -m github-copilot/claude-sonnet-4.5 "review this PR for issues" +curl -sf http://127.0.0.1:8080/health ``` -### Architectural decision (powerful) -```bash -claude -p --model opus "given these constraints, design the best approach for..." -``` +### For parallel work: +- Spawn sub-agents with `sessions_spawn` +- Each can hit local endpoint independently +- Coordinate results in main session -### Summarize long doc (long context) -```bash -cat huge_document.md | gemini -m gemini-2.5-pro "summarize key points" -``` - -## For Flynn (Clawdbot) - -When spawning sub-agents or doing background work: -- Use `sessions_spawn` with appropriate model hints -- For simple extraction: spawn with default (cheaper model) -- For complex analysis: explicitly request opus - -When using exec to call CLIs: -- Prefer `opencode run` for one-shot tasks (GitHub Copilot = included) -- Use `claude -p` when you need Claude-specific capabilities -- Use `gemini` for very long context or multimodal +### Privacy rule: +If the data is sensitive (credentials, personal info, proprietary), **only use local**. +Never send to cloud APIs without explicit permission. --- -*Principle: Don't use a sledgehammer to hang a picture.* +*Principle: Local first. Check availability. Parallelize when beneficial.*