6f5dd741a9
Design for adding LlamaCppClient to support local LLM inference via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
135 lines
3.7 KiB
Markdown
135 lines
3.7 KiB
Markdown
# llama.cpp Integration Design
|
|
|
|
**Date:** 2026-02-05
|
|
**Status:** Approved
|
|
**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)
|
|
|
|
## Overview
|
|
|
|
Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service).
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐ HTTP/SSE ┌──────────────────┐
|
|
│ Flynn │ ◄───────────────► │ llama-server │
|
|
│ LlamaCppClient │ localhost:8080 │ (systemd user) │
|
|
└─────────────────┘ └──────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────┐
|
|
│ CUDA / GPU │
|
|
│ qwen2.5-14b.gguf│
|
|
└──────────────────┘
|
|
```
|
|
|
|
## Integration Approach
|
|
|
|
**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because:
|
|
- `llamacpp` is a distinct provider in config schema
|
|
- Keeps cloud vs local concerns separate
|
|
- Room for llama.cpp-specific error handling
|
|
|
|
## LlamaCppClient Implementation
|
|
|
|
### File: `src/models/local/llamacpp.ts`
|
|
|
|
```typescript
|
|
export interface LlamaCppClientConfig {
|
|
endpoint: string; // Required, e.g., "http://localhost:8080"
|
|
authToken?: string; // Optional Bearer token
|
|
}
|
|
|
|
export class LlamaCppClient implements ModelClient {
|
|
chat(request: ChatRequest): Promise<ChatResponse>
|
|
chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
|
|
}
|
|
```
|
|
|
|
### API Details
|
|
|
|
Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint:
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"messages": [{ "role": "user", "content": "Hello" }],
|
|
"stream": true,
|
|
"max_tokens": 2048
|
|
}
|
|
```
|
|
|
|
**Response (non-streaming):**
|
|
```json
|
|
{
|
|
"choices": [{ "message": { "content": "Hi there!" } }],
|
|
"usage": { "prompt_tokens": 5, "completion_tokens": 10 }
|
|
}
|
|
```
|
|
|
|
**Streaming:** Server-Sent Events with `data: {...}` chunks.
|
|
|
|
### Error Handling
|
|
|
|
| Error | Message |
|
|
|-------|---------|
|
|
| Connection refused | "llama-server not running at {endpoint}" |
|
|
| Timeout (60s default) | "llama-server request timed out" |
|
|
| HTTP 4xx/5xx | Pass through server error message |
|
|
|
|
## Deployment
|
|
|
|
### User Responsibility: systemd user service
|
|
|
|
```ini
|
|
# ~/.config/systemd/user/llama-server.service
|
|
[Unit]
|
|
Description=llama.cpp server
|
|
After=network.target
|
|
|
|
[Service]
|
|
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
|
|
Restart=on-failure
|
|
|
|
[Install]
|
|
WantedBy=default.target
|
|
```
|
|
|
|
Key flags:
|
|
- `-ngl 99` - Offload all layers to GPU (CUDA)
|
|
- `-c 8192` - Context window size
|
|
- `-m` - Path to GGUF model file
|
|
|
|
Enable with:
|
|
```bash
|
|
systemctl --user enable --now llama-server
|
|
```
|
|
|
|
### Flynn Configuration
|
|
|
|
```yaml
|
|
models:
|
|
local:
|
|
provider: llamacpp
|
|
endpoint: http://localhost:8080
|
|
# auth_token: optional
|
|
```
|
|
|
|
## Files Changed
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation |
|
|
| `src/models/local/index.ts` | Export LlamaCppClient |
|
|
| Router/factory code | Handle `provider: 'llamacpp'` when creating clients |
|
|
|
|
## Files Unchanged
|
|
|
|
- `src/config/schema.ts` - Already has `llamacpp` provider enum
|
|
- `src/models/types.ts` - ModelClient interface already fits
|
|
|
|
## Testing
|
|
|
|
1. Unit tests with mocked HTTP responses
|
|
2. Integration test against running llama-server (skipped in CI)
|