From 6f5dd741a9e97a6a8e2e5430e74f8ba509a1ad95 Mon Sep 17 00:00:00 2001 From: William Valentin Date: Thu, 5 Feb 2026 13:05:58 -0800 Subject: [PATCH] docs: add llama.cpp integration design Design for adding LlamaCppClient to support local LLM inference via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M. Co-Authored-By: Claude Opus 4.5 --- .../2026-02-05-llamacpp-integration-design.md | 134 ++++++++++++++++++ 1 file changed, 134 insertions(+) create mode 100644 docs/plans/2026-02-05-llamacpp-integration-design.md diff --git a/docs/plans/2026-02-05-llamacpp-integration-design.md b/docs/plans/2026-02-05-llamacpp-integration-design.md new file mode 100644 index 0000000..c70d244 --- /dev/null +++ b/docs/plans/2026-02-05-llamacpp-integration-design.md @@ -0,0 +1,134 @@ +# llama.cpp Integration Design + +**Date:** 2026-02-05 +**Status:** Approved +**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM) + +## Overview + +Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service). + +## Architecture + +``` +┌─────────────────┐ HTTP/SSE ┌──────────────────┐ +│ Flynn │ ◄───────────────► │ llama-server │ +│ LlamaCppClient │ localhost:8080 │ (systemd user) │ +└─────────────────┘ └──────────────────┘ + │ + ▼ + ┌──────────────────┐ + │ CUDA / GPU │ + │ qwen2.5-14b.gguf│ + └──────────────────┘ +``` + +## Integration Approach + +**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because: +- `llamacpp` is a distinct provider in config schema +- Keeps cloud vs local concerns separate +- Room for llama.cpp-specific error handling + +## LlamaCppClient Implementation + +### File: `src/models/local/llamacpp.ts` + +```typescript +export interface LlamaCppClientConfig { + endpoint: string; // Required, e.g., "http://localhost:8080" + authToken?: string; // Optional Bearer token +} + +export class LlamaCppClient implements ModelClient { + chat(request: ChatRequest): Promise + chatStream(request: ChatRequest): AsyncIterable +} +``` + +### API Details + +Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint: + +**Request:** +```json +{ + "messages": [{ "role": "user", "content": "Hello" }], + "stream": true, + "max_tokens": 2048 +} +``` + +**Response (non-streaming):** +```json +{ + "choices": [{ "message": { "content": "Hi there!" } }], + "usage": { "prompt_tokens": 5, "completion_tokens": 10 } +} +``` + +**Streaming:** Server-Sent Events with `data: {...}` chunks. + +### Error Handling + +| Error | Message | +|-------|---------| +| Connection refused | "llama-server not running at {endpoint}" | +| Timeout (60s default) | "llama-server request timed out" | +| HTTP 4xx/5xx | Pass through server error message | + +## Deployment + +### User Responsibility: systemd user service + +```ini +# ~/.config/systemd/user/llama-server.service +[Unit] +Description=llama.cpp server +After=network.target + +[Service] +ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192 +Restart=on-failure + +[Install] +WantedBy=default.target +``` + +Key flags: +- `-ngl 99` - Offload all layers to GPU (CUDA) +- `-c 8192` - Context window size +- `-m` - Path to GGUF model file + +Enable with: +```bash +systemctl --user enable --now llama-server +``` + +### Flynn Configuration + +```yaml +models: + local: + provider: llamacpp + endpoint: http://localhost:8080 + # auth_token: optional +``` + +## Files Changed + +| File | Change | +|------|--------| +| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation | +| `src/models/local/index.ts` | Export LlamaCppClient | +| Router/factory code | Handle `provider: 'llamacpp'` when creating clients | + +## Files Unchanged + +- `src/config/schema.ts` - Already has `llamacpp` provider enum +- `src/models/types.ts` - ModelClient interface already fits + +## Testing + +1. Unit tests with mocked HTTP responses +2. Integration test against running llama-server (skipped in CI)