docs: add llama.cpp integration design

Design for adding LlamaCppClient to support local LLM inference via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 13:05:58 -08:00
parent f891c7aee8
commit 6f5dd741a9
1 changed files with 134 additions and 0 deletions
@@ -0,0 +1,134 @@
+# llama.cpp Integration Design
+
+**Date:** 2026-02-05
+**Status:** Approved
+**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)
+
+## Overview
+
+Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service).
+
+## Architecture
+
+```
+┌─────────────────┐     HTTP/SSE      ┌──────────────────┐
+│     Flynn       │ ◄───────────────► │   llama-server   │
+│  LlamaCppClient │   localhost:8080  │  (systemd user)  │
+└─────────────────┘                   └──────────────────┘
+                                              │
+                                              ▼
+                                      ┌──────────────────┐
+                                      │  CUDA / GPU      │
+                                      │  qwen2.5-14b.gguf│
+                                      └──────────────────┘
+```
+
+## Integration Approach
+
+**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because:
+- `llamacpp` is a distinct provider in config schema
+- Keeps cloud vs local concerns separate
+- Room for llama.cpp-specific error handling
+
+## LlamaCppClient Implementation
+
+### File: `src/models/local/llamacpp.ts`
+
+```typescript
+export interface LlamaCppClientConfig {
+  endpoint: string;        // Required, e.g., "http://localhost:8080"
+  authToken?: string;      // Optional Bearer token
+}
+
+export class LlamaCppClient implements ModelClient {
+  chat(request: ChatRequest): Promise<ChatResponse>
+  chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
+}
+```
+
+### API Details
+
+Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint:
+
+**Request:**
+```json
+{
+  "messages": [{ "role": "user", "content": "Hello" }],
+  "stream": true,
+  "max_tokens": 2048
+}
+```
+
+**Response (non-streaming):**
+```json
+{
+  "choices": [{ "message": { "content": "Hi there!" } }],
+  "usage": { "prompt_tokens": 5, "completion_tokens": 10 }
+}
+```
+
+**Streaming:** Server-Sent Events with `data: {...}` chunks.
+
+### Error Handling
+
+| Error | Message |
+|-------|---------|
+| Connection refused | "llama-server not running at {endpoint}" |
+| Timeout (60s default) | "llama-server request timed out" |
+| HTTP 4xx/5xx | Pass through server error message |
+
+## Deployment
+
+### User Responsibility: systemd user service
+
+```ini
+# ~/.config/systemd/user/llama-server.service
+[Unit]
+Description=llama.cpp server
+After=network.target
+
+[Service]
+ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
+Restart=on-failure
+
+[Install]
+WantedBy=default.target
+```
+
+Key flags:
+- `-ngl 99` - Offload all layers to GPU (CUDA)
+- `-c 8192` - Context window size
+- `-m` - Path to GGUF model file
+
+Enable with:
+```bash
+systemctl --user enable --now llama-server
+```
+
+### Flynn Configuration
+
+```yaml
+models:
+  local:
+    provider: llamacpp
+    endpoint: http://localhost:8080
+    # auth_token: optional
+```
+
+## Files Changed
+
+| File | Change |
+|------|--------|
+| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation |
+| `src/models/local/index.ts` | Export LlamaCppClient |
+| Router/factory code | Handle `provider: 'llamacpp'` when creating clients |
+
+## Files Unchanged
+
+- `src/config/schema.ts` - Already has `llamacpp` provider enum
+- `src/models/types.ts` - ModelClient interface already fits
+
+## Testing
+
+1. Unit tests with mocked HTTP responses
+2. Integration test against running llama-server (skipped in CI)