# llama.cpp Integration Design

**Date:** 2026-02-05
**Status:** Approved
**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)

## Overview

Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service).

## Architecture

```
┌─────────────────┐     HTTP/SSE      ┌──────────────────┐
│     Flynn       │ ◄───────────────► │   llama-server   │
│  LlamaCppClient │   localhost:8080  │  (systemd user)  │
└─────────────────┘                   └──────────────────┘
                                              │
                                              ▼
                                      ┌──────────────────┐
                                      │  CUDA / GPU      │
                                      │  qwen2.5-14b.gguf│
                                      └──────────────────┘
```

## Integration Approach

**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because:
- `llamacpp` is a distinct provider in config schema
- Keeps cloud vs local concerns separate
- Room for llama.cpp-specific error handling

## LlamaCppClient Implementation

### File: `src/models/local/llamacpp.ts`

```typescript
export interface LlamaCppClientConfig {
  endpoint: string;        // Required, e.g., "http://localhost:8080"
  authToken?: string;      // Optional Bearer token
}

export class LlamaCppClient implements ModelClient {
  chat(request: ChatRequest): Promise<ChatResponse>
  chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
}
```

### API Details

Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint:

**Request:**
```json
{
  "messages": [{ "role": "user", "content": "Hello" }],
  "stream": true,
  "max_tokens": 2048
}
```

**Response (non-streaming):**
```json
{
  "choices": [{ "message": { "content": "Hi there!" } }],
  "usage": { "prompt_tokens": 5, "completion_tokens": 10 }
}
```

**Streaming:** Server-Sent Events with `data: {...}` chunks.

### Error Handling

| Error | Message |
|-------|---------|
| Connection refused | "llama-server not running at {endpoint}" |
| Timeout (60s default) | "llama-server request timed out" |
| HTTP 4xx/5xx | Pass through server error message |

## Deployment

### User Responsibility: systemd user service

```ini
# ~/.config/systemd/user/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target

[Service]
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
Restart=on-failure

[Install]
WantedBy=default.target
```

Key flags:
- `-ngl 99` - Offload all layers to GPU (CUDA)
- `-c 8192` - Context window size
- `-m` - Path to GGUF model file

Enable with:
```bash
systemctl --user enable --now llama-server
```

### Flynn Configuration

```yaml
models:
  local:
    provider: llamacpp
    endpoint: http://localhost:8080
    # auth_token: optional
```

## Files Changed

| File | Change |
|------|--------|
| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation |
| `src/models/local/index.ts` | Export LlamaCppClient |
| Router/factory code | Handle `provider: 'llamacpp'` when creating clients |

## Files Unchanged

- `src/config/schema.ts` - Already has `llamacpp` provider enum
- `src/models/types.ts` - ModelClient interface already fits

## Testing

1. Unit tests with mocked HTTP responses
2. Integration test against running llama-server (skipped in CI)