Files

William Valentin 3a3838e19b Rewrite LLM routing with local-first principles

- Privacy/confidentiality: local LLMs only for sensitive data
- Check availability before using local (may not be running)
- Long-running tasks: local (no API costs/limits)
- Multi-agent parallelism for speed
- Fallback patterns when local is down

2026-01-26 22:34:26 -08:00

4.4 KiB

Raw Blame History

LLM Routing Guide

Use the right model for the job. Local first when possible.

Core Principles

Privacy/Confidentiality → Local LLMs (data never leaves the machine)
Long-running tasks → Local LLMs (no API costs, no rate limits)
Parallelism → Multi-agent with local LLMs (spawn multiple workers)
Check availability → Local LLMs may not always be running
Cost efficiency → Local → Copilot → Cloud APIs

Available Resources

Local (llama-swap @ :8080)

# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"

# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'

Alias	Model	Best For
`gemma`	Gemma-3-12B	Fast, balanced, fits fully
`qwen3`	Qwen3-30B-A3B	General purpose, quality
`coder`	Qwen3-Coder-30B	Code specialist
`glm`	GLM-4.7-Flash	Fast reasoning
`reasoning`	Ministral-3-14B	Reasoning tasks
`gpt-oss`	GPT-OSS-20B	Experimental

Homelab (Ollama @ 100.85.116.57:11434)

Larger models, more capacity
Still private (your network)
Check: curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'

GitHub Copilot (via opencode)

"Free" with subscription
Good for one-shot tasks
Has internet access

Cloud APIs (Clawdbot/me)

Most capable (opus, sonnet)
Best tool integration
Paid per token

When to Use What

🔒 Sensitive/Private Data → LOCAL ONLY

# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'

Never send sensitive data to cloud APIs.

⏱️ Long-Running Tasks → LOCAL

# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "coder", "messages": [...], "max_tokens": 4096}'

No API timeouts
No rate limits
No cost accumulation

🚀 Parallel Work → MULTI-AGENT LOCAL

When speed matters, spawn multiple workers:

# Flynn can spawn sub-agents hitting local LLMs
# Each agent works independently, results merge

Use for: bulk analysis, multi-file processing, research tasks
Coordinate via sessions_spawn with local model routing

⚡ Quick One-Shot → COPILOT or LOCAL

# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
  opencode run -m github-copilot/claude-haiku-4.5 "quick question"

🧠 Complex Reasoning → CLOUD (opus)

Multi-step orchestration
Tool use (browser, APIs, messaging)
When quality > cost

📚 Massive Context (>32k) → GEMINI

cat huge_file.md | gemini -m gemini-2.5-pro "analyze"

Availability Check Pattern

Before using local LLMs, verify they're up:

# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }

# Use with fallback
if llama_up; then
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
  opencode run -m github-copilot/claude-haiku-4.5 "..."
fi

For Flynn: check curl -sf http://127.0.0.1:8080/health before routing to local.

Service Management

# Start local LLMs
systemctl --user start llama-swap

# Stop (save GPU for gaming)
systemctl --user stop llama-swap

# Check status
systemctl --user status llama-swap

Model Selection Matrix

Scenario	First Choice	Fallback
Private data	`qwen3` (local)	— (no fallback)
Long task	`coder` (local)	Homelab Ollama
Quick question	`gemma` (local)	`haiku` (Copilot)
Code review	`coder` (local)	`sonnet` (Copilot)
Complex reasoning	`opus` (cloud)	`qwen3` (local)
Bulk processing	Multi-agent local	—
100k+ context	`gemini-2.5-pro`	—

For Flynn

Before using local LLMs:

curl -sf http://127.0.0.1:8080/health

For parallel work:

Spawn sub-agents with sessions_spawn
Each can hit local endpoint independently
Coordinate results in main session

Privacy rule:

If the data is sensitive (credentials, personal info, proprietary), only use local. Never send to cloud APIs without explicit permission.

Principle: Local first. Check availability. Parallelize when beneficial.

4.4 KiB Raw Blame History