# LLM Routing Guide

Use the right model for the job. **Local first** when possible.

## Core Principles

1. **Privacy/Confidentiality** → Local LLMs (data never leaves the machine)
2. **Long-running tasks** → Local LLMs (no API costs, no rate limits)
3. **Parallelism** → Multi-agent with local LLMs (spawn multiple workers)
4. **Check availability** → Local LLMs may not always be running
5. **Cost efficiency** → Local → Copilot → Cloud APIs

## Available Resources

### Local (llama-swap @ :8080)
```bash
# Check if running
curl -s http://127.0.0.1:8080/health && echo "UP" || echo "DOWN"

# List loaded models
curl -s http://127.0.0.1:8080/v1/models | jq -r '.data[].id'
```

| Alias | Model | Best For |
|-------|-------|----------|
| `gemma` | Gemma-3-12B | Fast, balanced, fits fully |
| `qwen3` | Qwen3-30B-A3B | General purpose, quality |
| `coder` | Qwen3-Coder-30B | Code specialist |
| `glm` | GLM-4.7-Flash | Fast reasoning |
| `reasoning` | Ministral-3-14B | Reasoning tasks |
| `gpt-oss` | GPT-OSS-20B | Experimental |

### Homelab (Ollama @ 100.85.116.57:11434)
- Larger models, more capacity
- Still private (your network)
- Check: `curl -s http://100.85.116.57:11434/api/tags | jq '.models[].name'`

### GitHub Copilot (via opencode)
- "Free" with subscription
- Good for one-shot tasks
- Has internet access

### Cloud APIs (Clawdbot/me)
- Most capable (opus, sonnet)
- Best tool integration
- Paid per token

## When to Use What

### 🔒 Sensitive/Private Data → **LOCAL ONLY**
```bash
# Credentials, personal info, proprietary code
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "qwen3", "messages": [{"role": "user", "content": "Review this private config..."}]}'
```
**Never send sensitive data to cloud APIs.**

### ⏱️ Long-Running Tasks → **LOCAL**
```bash
# Analysis that takes minutes, batch processing
curl http://127.0.0.1:8080/v1/chat/completions \
  -d '{"model": "coder", "messages": [...], "max_tokens": 4096}'
```
- No API timeouts
- No rate limits
- No cost accumulation

### 🚀 Parallel Work → **MULTI-AGENT**
When speed matters, spawn multiple workers:
```bash
# Flynn can spawn sub-agents targeting any LLM
# Each agent works independently, results merge
```
- Use for: bulk analysis, multi-file processing, research tasks
- Coordinate via `sessions_spawn` with model param
- **Local:** best for privacy + no rate limits
- **Cloud:** best for complex tasks needing quality
- Mix and match based on task requirements

### ⚡ Quick One-Shot → **COPILOT or LOCAL**
```bash
# If local is up, prefer it
curl -sf http://127.0.0.1:8080/health && \
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}' || \
  opencode run -m github-copilot/claude-haiku-4.5 "quick question"
```

### 🧠 Complex Reasoning → **CLOUD (opus)**
- Multi-step orchestration
- Tool use (browser, APIs, messaging)
- When quality > cost

### 📚 Massive Context (>32k) → **GEMINI**
```bash
cat huge_file.md | gemini -m gemini-2.5-pro "analyze"
```

## Availability Check Pattern

Before using local LLMs, verify they're up:

```bash
# Quick check
llama_up() { curl -sf http://127.0.0.1:8080/health >/dev/null; }

# Use with fallback
if llama_up; then
  curl http://127.0.0.1:8080/v1/chat/completions -d '{"model": "gemma", ...}'
else
  opencode run -m github-copilot/claude-haiku-4.5 "..."
fi
```

For Flynn: check `curl -sf http://127.0.0.1:8080/health` before routing to local.

## Service Management

```bash
# Start local LLMs
systemctl --user start llama-swap

# Stop (save GPU for gaming)
systemctl --user stop llama-swap

# Check status
systemctl --user status llama-swap
```

## Model Selection Matrix

| Scenario | First Choice | Fallback |
|----------|--------------|----------|
| Private data | `qwen3` (local) | — (no fallback) |
| Long task | `coder` (local) | Homelab Ollama |
| Quick question | `gemma` (local) | `haiku` (Copilot) |
| Code review | `coder` (local) | `sonnet` (Copilot) |
| Complex reasoning | `opus` (cloud) | `qwen3` (local) |
| Bulk processing | Multi-agent local | — |
| 100k+ context | `gemini-2.5-pro` | — |

## For Flynn

### Before using local LLMs:
```bash
curl -sf http://127.0.0.1:8080/health
```

### For parallel work:
- Spawn sub-agents with `sessions_spawn`
- Each can hit local endpoint independently
- Coordinate results in main session

### Privacy rule:
If the data is sensitive (credentials, personal info, proprietary), **only use local**.
Never send to cloud APIs without explicit permission.

---

*Principle: Local first. Check availability. Parallelize when beneficial.*