Files
flynn/docs/plans/2026-02-05-llamacpp-integration-design.md
T
William Valentin 6f5dd741a9 docs: add llama.cpp integration design
Design for adding LlamaCppClient to support local LLM inference
via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:05:58 -08:00

135 lines
3.7 KiB
Markdown

# llama.cpp Integration Design
**Date:** 2026-02-05
**Status:** Approved
**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)
## Overview
Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service).
## Architecture
```
┌─────────────────┐ HTTP/SSE ┌──────────────────┐
│ Flynn │ ◄───────────────► │ llama-server │
│ LlamaCppClient │ localhost:8080 │ (systemd user) │
└─────────────────┘ └──────────────────┘
┌──────────────────┐
│ CUDA / GPU │
│ qwen2.5-14b.gguf│
└──────────────────┘
```
## Integration Approach
**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because:
- `llamacpp` is a distinct provider in config schema
- Keeps cloud vs local concerns separate
- Room for llama.cpp-specific error handling
## LlamaCppClient Implementation
### File: `src/models/local/llamacpp.ts`
```typescript
export interface LlamaCppClientConfig {
endpoint: string; // Required, e.g., "http://localhost:8080"
authToken?: string; // Optional Bearer token
}
export class LlamaCppClient implements ModelClient {
chat(request: ChatRequest): Promise<ChatResponse>
chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
}
```
### API Details
Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint:
**Request:**
```json
{
"messages": [{ "role": "user", "content": "Hello" }],
"stream": true,
"max_tokens": 2048
}
```
**Response (non-streaming):**
```json
{
"choices": [{ "message": { "content": "Hi there!" } }],
"usage": { "prompt_tokens": 5, "completion_tokens": 10 }
}
```
**Streaming:** Server-Sent Events with `data: {...}` chunks.
### Error Handling
| Error | Message |
|-------|---------|
| Connection refused | "llama-server not running at {endpoint}" |
| Timeout (60s default) | "llama-server request timed out" |
| HTTP 4xx/5xx | Pass through server error message |
## Deployment
### User Responsibility: systemd user service
```ini
# ~/.config/systemd/user/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target
[Service]
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
Restart=on-failure
[Install]
WantedBy=default.target
```
Key flags:
- `-ngl 99` - Offload all layers to GPU (CUDA)
- `-c 8192` - Context window size
- `-m` - Path to GGUF model file
Enable with:
```bash
systemctl --user enable --now llama-server
```
### Flynn Configuration
```yaml
models:
local:
provider: llamacpp
endpoint: http://localhost:8080
# auth_token: optional
```
## Files Changed
| File | Change |
|------|--------|
| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation |
| `src/models/local/index.ts` | Export LlamaCppClient |
| Router/factory code | Handle `provider: 'llamacpp'` when creating clients |
## Files Unchanged
- `src/config/schema.ts` - Already has `llamacpp` provider enum
- `src/models/types.ts` - ModelClient interface already fits
## Testing
1. Unit tests with mocked HTTP responses
2. Integration test against running llama-server (skipped in CI)