# llama.cpp Integration Design **Date:** 2026-02-05 **Status:** Approved **Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM) ## Overview Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service). ## Architecture ``` ┌─────────────────┐ HTTP/SSE ┌──────────────────┐ │ Flynn │ ◄───────────────► │ llama-server │ │ LlamaCppClient │ localhost:8080 │ (systemd user) │ └─────────────────┘ └──────────────────┘ │ ▼ ┌──────────────────┐ │ CUDA / GPU │ │ qwen2.5-14b.gguf│ └──────────────────┘ ``` ## Integration Approach **Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because: - `llamacpp` is a distinct provider in config schema - Keeps cloud vs local concerns separate - Room for llama.cpp-specific error handling ## LlamaCppClient Implementation ### File: `src/models/local/llamacpp.ts` ```typescript export interface LlamaCppClientConfig { endpoint: string; // Required, e.g., "http://localhost:8080" authToken?: string; // Optional Bearer token } export class LlamaCppClient implements ModelClient { chat(request: ChatRequest): Promise chatStream(request: ChatRequest): AsyncIterable } ``` ### API Details Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint: **Request:** ```json { "messages": [{ "role": "user", "content": "Hello" }], "stream": true, "max_tokens": 2048 } ``` **Response (non-streaming):** ```json { "choices": [{ "message": { "content": "Hi there!" } }], "usage": { "prompt_tokens": 5, "completion_tokens": 10 } } ``` **Streaming:** Server-Sent Events with `data: {...}` chunks. ### Error Handling | Error | Message | |-------|---------| | Connection refused | "llama-server not running at {endpoint}" | | Timeout (60s default) | "llama-server request timed out" | | HTTP 4xx/5xx | Pass through server error message | ## Deployment ### User Responsibility: systemd user service ```ini # ~/.config/systemd/user/llama-server.service [Unit] Description=llama.cpp server After=network.target [Service] ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192 Restart=on-failure [Install] WantedBy=default.target ``` Key flags: - `-ngl 99` - Offload all layers to GPU (CUDA) - `-c 8192` - Context window size - `-m` - Path to GGUF model file Enable with: ```bash systemctl --user enable --now llama-server ``` ### Flynn Configuration ```yaml models: local: provider: llamacpp endpoint: http://localhost:8080 # auth_token: optional ``` ## Files Changed | File | Change | |------|--------| | `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation | | `src/models/local/index.ts` | Export LlamaCppClient | | Router/factory code | Handle `provider: 'llamacpp'` when creating clients | ## Files Unchanged - `src/config/schema.ts` - Already has `llamacpp` provider enum - `src/models/types.ts` - ModelClient interface already fits ## Testing 1. Unit tests with mocked HTTP responses 2. Integration test against running llama-server (skipped in CI)