Files
clawdbot/LLM-ROUTING.md
2026-01-26 22:37:39 -08:00

4.5 KiB

LLM Routing Guide

Use the right model for the job. Local first when possible.

Core Principles

  1. Privacy/Confidentiality → Local LLMs (data never leaves the machine)
  2. Long-running tasks → Local LLMs (no API costs, no rate limits)
  3. Parallelism → Multi-agent with local LLMs (spawn multiple workers)
  4. Check availability → Local LLMs may not always be running
  5. Cost efficiency → Local → Copilot → Cloud APIs

Available Resources

Local (llama-swap @ :8080)

# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"

# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
Alias Model Best For
gemma Gemma-3-12B Fast, balanced, fits fully
qwen3 Qwen3-30B-A3B General purpose, quality
coder Qwen3-Coder-30B Code specialist
glm GLM-4.7-Flash Fast reasoning
reasoning Ministral-3-14B Reasoning tasks
gpt-oss GPT-OSS-20B Experimental

Homelab (Ollama @ 100.85.116.57:11434)

  • Larger models, more capacity
  • Still private (your network)
  • Check: curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'

GitHub Copilot (via opencode)

  • "Free" with subscription
  • Good for one-shot tasks
  • Has internet access

Cloud APIs (Clawdbot/me)

  • Most capable (opus, sonnet)
  • Best tool integration
  • Paid per token

When to Use What

🔒 Sensitive/Private Data → LOCAL ONLY

# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'

Never send sensitive data to cloud APIs.

⏱️ Long-Running Tasks → LOCAL

# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
  • No API timeouts
  • No rate limits
  • No cost accumulation

🚀 Parallel Work → MULTI-AGENT

When speed matters, spawn multiple workers:

# Flynn can spawn sub-agents targeting any LLM
# Each agent works independently, results merge
  • Use for: bulk analysis, multi-file processing, research tasks
  • Coordinate via sessions_spawn with model param
  • Local: best for privacy + no rate limits
  • Cloud: best for complex tasks needing quality
  • Mix and match based on task requirements

Quick One-Shot → COPILOT or LOCAL

# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
  opencode run -m github-copilot/claude-haiku-4.5 "quick question"

🧠 Complex Reasoning → CLOUD (opus)

  • Multi-step orchestration
  • Tool use (browser, APIs, messaging)
  • When quality > cost

📚 Massive Context (>32k) → GEMINI

cat huge_file.md | gemini -m gemini-2.5-pro "analyze"

Availability Check Pattern

Before using local LLMs, verify they're up:

# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }

# Use with fallback
if llama_up; then
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
  opencode run -m github-copilot/claude-haiku-4.5 "..."
fi

For Flynn: check curl -sf http://127.0.0.1:8080/health before routing to local.

Service Management

# Start local LLMs
systemctl --user start llama-swap

# Stop (save GPU for gaming)
systemctl --user stop llama-swap

# Check status
systemctl --user status llama-swap

Model Selection Matrix

Scenario First Choice Fallback
Private data qwen3 (local) — (no fallback)
Long task coder (local) Homelab Ollama
Quick question gemma (local) haiku (Copilot)
Code review coder (local) sonnet (Copilot)
Complex reasoning opus (cloud) qwen3 (local)
Bulk processing Multi-agent local
100k+ context gemini-2.5-pro

For Flynn

Before using local LLMs:

curl -sf http://127.0.0.1:8080/health

For parallel work:

  • Spawn sub-agents with sessions_spawn
  • Each can hit local endpoint independently
  • Coordinate results in main session

Privacy rule:

If the data is sensitive (credentials, personal info, proprietary), only use local. Never send to cloud APIs without explicit permission.


Principle: Local first. Check availability. Parallelize when beneficial.