Files
clawdbot/LLM-ROUTING.md
William Valentin 3a3838e19b Rewrite LLM routing with local-first principles
- Privacy/confidentiality: local LLMs only for sensitive data
- Check availability before using local (may not be running)
- Long-running tasks: local (no API costs/limits)
- Multi-agent parallelism for speed
- Fallback patterns when local is down
2026-01-26 22:34:26 -08:00

4.4 KiB

LLM Routing Guide

Use the right model for the job. Local first when possible.

Core Principles

  1. Privacy/Confidentiality → Local LLMs (data never leaves the machine)
  2. Long-running tasks → Local LLMs (no API costs, no rate limits)
  3. Parallelism → Multi-agent with local LLMs (spawn multiple workers)
  4. Check availability → Local LLMs may not always be running
  5. Cost efficiency → Local → Copilot → Cloud APIs

Available Resources

Local (llama-swap @ :8080)

# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"

# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
Alias Model Best For
gemma Gemma-3-12B Fast, balanced, fits fully
qwen3 Qwen3-30B-A3B General purpose, quality
coder Qwen3-Coder-30B Code specialist
glm GLM-4.7-Flash Fast reasoning
reasoning Ministral-3-14B Reasoning tasks
gpt-oss GPT-OSS-20B Experimental

Homelab (Ollama @ 100.85.116.57:11434)

  • Larger models, more capacity
  • Still private (your network)
  • Check: curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'

GitHub Copilot (via opencode)

  • "Free" with subscription
  • Good for one-shot tasks
  • Has internet access

Cloud APIs (Clawdbot/me)

  • Most capable (opus, sonnet)
  • Best tool integration
  • Paid per token

When to Use What

🔒 Sensitive/Private Data → LOCAL ONLY

# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'

Never send sensitive data to cloud APIs.

⏱️ Long-Running Tasks → LOCAL

# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
  • No API timeouts
  • No rate limits
  • No cost accumulation

🚀 Parallel Work → MULTI-AGENT LOCAL

When speed matters, spawn multiple workers:

# Flynn can spawn sub-agents hitting local LLMs
# Each agent works independently, results merge
  • Use for: bulk analysis, multi-file processing, research tasks
  • Coordinate via sessions_spawn with local model routing

Quick One-Shot → COPILOT or LOCAL

# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
  opencode run -m github-copilot/claude-haiku-4.5 "quick question"

🧠 Complex Reasoning → CLOUD (opus)

  • Multi-step orchestration
  • Tool use (browser, APIs, messaging)
  • When quality > cost

📚 Massive Context (>32k) → GEMINI

cat huge_file.md | gemini -m gemini-2.5-pro "analyze"

Availability Check Pattern

Before using local LLMs, verify they're up:

# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }

# Use with fallback
if llama_up; then
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
  opencode run -m github-copilot/claude-haiku-4.5 "..."
fi

For Flynn: check curl -sf http://127.0.0.1:8080/health before routing to local.

Service Management

# Start local LLMs
systemctl --user start llama-swap

# Stop (save GPU for gaming)
systemctl --user stop llama-swap

# Check status
systemctl --user status llama-swap

Model Selection Matrix

Scenario First Choice Fallback
Private data qwen3 (local) — (no fallback)
Long task coder (local) Homelab Ollama
Quick question gemma (local) haiku (Copilot)
Code review coder (local) sonnet (Copilot)
Complex reasoning opus (cloud) qwen3 (local)
Bulk processing Multi-agent local
100k+ context gemini-2.5-pro

For Flynn

Before using local LLMs:

curl -sf http://127.0.0.1:8080/health

For parallel work:

  • Spawn sub-agents with sessions_spawn
  • Each can hit local endpoint independently
  • Coordinate results in main session

Privacy rule:

If the data is sensitive (credentials, personal info, proprietary), only use local. Never send to cloud APIs without explicit permission.


Principle: Local first. Check availability. Parallelize when beneficial.