6f5dd741a9
Design for adding LlamaCppClient to support local LLM inference via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.7 KiB
3.7 KiB
llama.cpp Integration Design
Date: 2026-02-05 Status: Approved Target Model: Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)
Overview
Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed llama-server instance (via systemd user service).
Architecture
┌─────────────────┐ HTTP/SSE ┌──────────────────┐
│ Flynn │ ◄───────────────► │ llama-server │
│ LlamaCppClient │ localhost:8080 │ (systemd user) │
└─────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ CUDA / GPU │
│ qwen2.5-14b.gguf│
└──────────────────┘
Integration Approach
Dedicated LlamaCppClient - Separate from OpenAI client despite API similarity because:
llamacppis a distinct provider in config schema- Keeps cloud vs local concerns separate
- Room for llama.cpp-specific error handling
LlamaCppClient Implementation
File: src/models/local/llamacpp.ts
export interface LlamaCppClientConfig {
endpoint: string; // Required, e.g., "http://localhost:8080"
authToken?: string; // Optional Bearer token
}
export class LlamaCppClient implements ModelClient {
chat(request: ChatRequest): Promise<ChatResponse>
chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
}
API Details
Calls llama-server's OpenAI-compatible /v1/chat/completions endpoint:
Request:
{
"messages": [{ "role": "user", "content": "Hello" }],
"stream": true,
"max_tokens": 2048
}
Response (non-streaming):
{
"choices": [{ "message": { "content": "Hi there!" } }],
"usage": { "prompt_tokens": 5, "completion_tokens": 10 }
}
Streaming: Server-Sent Events with data: {...} chunks.
Error Handling
| Error | Message |
|---|---|
| Connection refused | "llama-server not running at {endpoint}" |
| Timeout (60s default) | "llama-server request timed out" |
| HTTP 4xx/5xx | Pass through server error message |
Deployment
User Responsibility: systemd user service
# ~/.config/systemd/user/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target
[Service]
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
Restart=on-failure
[Install]
WantedBy=default.target
Key flags:
-ngl 99- Offload all layers to GPU (CUDA)-c 8192- Context window size-m- Path to GGUF model file
Enable with:
systemctl --user enable --now llama-server
Flynn Configuration
models:
local:
provider: llamacpp
endpoint: http://localhost:8080
# auth_token: optional
Files Changed
| File | Change |
|---|---|
src/models/local/llamacpp.ts |
New - LlamaCppClient implementation |
src/models/local/index.ts |
Export LlamaCppClient |
| Router/factory code | Handle provider: 'llamacpp' when creating clients |
Files Unchanged
src/config/schema.ts- Already hasllamacppprovider enumsrc/models/types.ts- ModelClient interface already fits
Testing
- Unit tests with mocked HTTP responses
- Integration test against running llama-server (skipped in CI)