Files
flynn/docs/plans/2026-02-05-llamacpp-integration-design.md
T
William Valentin 6f5dd741a9 docs: add llama.cpp integration design
Design for adding LlamaCppClient to support local LLM inference
via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:05:58 -08:00

3.7 KiB

llama.cpp Integration Design

Date: 2026-02-05 Status: Approved Target Model: Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)

Overview

Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed llama-server instance (via systemd user service).

Architecture

┌─────────────────┐     HTTP/SSE      ┌──────────────────┐
│     Flynn       │ ◄───────────────► │   llama-server   │
│  LlamaCppClient │   localhost:8080  │  (systemd user)  │
└─────────────────┘                   └──────────────────┘
                                              │
                                              ▼
                                      ┌──────────────────┐
                                      │  CUDA / GPU      │
                                      │  qwen2.5-14b.gguf│
                                      └──────────────────┘

Integration Approach

Dedicated LlamaCppClient - Separate from OpenAI client despite API similarity because:

  • llamacpp is a distinct provider in config schema
  • Keeps cloud vs local concerns separate
  • Room for llama.cpp-specific error handling

LlamaCppClient Implementation

File: src/models/local/llamacpp.ts

export interface LlamaCppClientConfig {
  endpoint: string;        // Required, e.g., "http://localhost:8080"
  authToken?: string;      // Optional Bearer token
}

export class LlamaCppClient implements ModelClient {
  chat(request: ChatRequest): Promise<ChatResponse>
  chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
}

API Details

Calls llama-server's OpenAI-compatible /v1/chat/completions endpoint:

Request:

{
  "messages": [{ "role": "user", "content": "Hello" }],
  "stream": true,
  "max_tokens": 2048
}

Response (non-streaming):

{
  "choices": [{ "message": { "content": "Hi there!" } }],
  "usage": { "prompt_tokens": 5, "completion_tokens": 10 }
}

Streaming: Server-Sent Events with data: {...} chunks.

Error Handling

Error Message
Connection refused "llama-server not running at {endpoint}"
Timeout (60s default) "llama-server request timed out"
HTTP 4xx/5xx Pass through server error message

Deployment

User Responsibility: systemd user service

# ~/.config/systemd/user/llama-server.service
[Unit]
Description=llama.cpp server
After=network.target

[Service]
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
Restart=on-failure

[Install]
WantedBy=default.target

Key flags:

  • -ngl 99 - Offload all layers to GPU (CUDA)
  • -c 8192 - Context window size
  • -m - Path to GGUF model file

Enable with:

systemctl --user enable --now llama-server

Flynn Configuration

models:
  local:
    provider: llamacpp
    endpoint: http://localhost:8080
    # auth_token: optional

Files Changed

File Change
src/models/local/llamacpp.ts New - LlamaCppClient implementation
src/models/local/index.ts Export LlamaCppClient
Router/factory code Handle provider: 'llamacpp' when creating clients

Files Unchanged

  • src/config/schema.ts - Already has llamacpp provider enum
  • src/models/types.ts - ModelClient interface already fits

Testing

  1. Unit tests with mocked HTTP responses
  2. Integration test against running llama-server (skipped in CI)