docs: add llama.cpp integration design
Design for adding LlamaCppClient to support local LLM inference via llama-server with CUDA. Target model: Qwen 2.5 14B Q4_K_M. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,134 @@
|
||||
# llama.cpp Integration Design
|
||||
|
||||
**Date:** 2026-02-05
|
||||
**Status:** Approved
|
||||
**Target Model:** Qwen 2.5 14B Q4_K_M (fits 12GB VRAM)
|
||||
|
||||
## Overview
|
||||
|
||||
Add llama.cpp support to Flynn for local LLM inference using CUDA. Flynn acts as a client connecting to a user-managed `llama-server` instance (via systemd user service).
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ HTTP/SSE ┌──────────────────┐
|
||||
│ Flynn │ ◄───────────────► │ llama-server │
|
||||
│ LlamaCppClient │ localhost:8080 │ (systemd user) │
|
||||
└─────────────────┘ └──────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ CUDA / GPU │
|
||||
│ qwen2.5-14b.gguf│
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
## Integration Approach
|
||||
|
||||
**Dedicated LlamaCppClient** - Separate from OpenAI client despite API similarity because:
|
||||
- `llamacpp` is a distinct provider in config schema
|
||||
- Keeps cloud vs local concerns separate
|
||||
- Room for llama.cpp-specific error handling
|
||||
|
||||
## LlamaCppClient Implementation
|
||||
|
||||
### File: `src/models/local/llamacpp.ts`
|
||||
|
||||
```typescript
|
||||
export interface LlamaCppClientConfig {
|
||||
endpoint: string; // Required, e.g., "http://localhost:8080"
|
||||
authToken?: string; // Optional Bearer token
|
||||
}
|
||||
|
||||
export class LlamaCppClient implements ModelClient {
|
||||
chat(request: ChatRequest): Promise<ChatResponse>
|
||||
chatStream(request: ChatRequest): AsyncIterable<ChatStreamEvent>
|
||||
}
|
||||
```
|
||||
|
||||
### API Details
|
||||
|
||||
Calls llama-server's OpenAI-compatible `/v1/chat/completions` endpoint:
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"messages": [{ "role": "user", "content": "Hello" }],
|
||||
"stream": true,
|
||||
"max_tokens": 2048
|
||||
}
|
||||
```
|
||||
|
||||
**Response (non-streaming):**
|
||||
```json
|
||||
{
|
||||
"choices": [{ "message": { "content": "Hi there!" } }],
|
||||
"usage": { "prompt_tokens": 5, "completion_tokens": 10 }
|
||||
}
|
||||
```
|
||||
|
||||
**Streaming:** Server-Sent Events with `data: {...}` chunks.
|
||||
|
||||
### Error Handling
|
||||
|
||||
| Error | Message |
|
||||
|-------|---------|
|
||||
| Connection refused | "llama-server not running at {endpoint}" |
|
||||
| Timeout (60s default) | "llama-server request timed out" |
|
||||
| HTTP 4xx/5xx | Pass through server error message |
|
||||
|
||||
## Deployment
|
||||
|
||||
### User Responsibility: systemd user service
|
||||
|
||||
```ini
|
||||
# ~/.config/systemd/user/llama-server.service
|
||||
[Unit]
|
||||
Description=llama.cpp server
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/bin/llama-server -m /data/models/qwen2.5-14b-q4_k_m.gguf -ngl 99 -c 8192
|
||||
Restart=on-failure
|
||||
|
||||
[Install]
|
||||
WantedBy=default.target
|
||||
```
|
||||
|
||||
Key flags:
|
||||
- `-ngl 99` - Offload all layers to GPU (CUDA)
|
||||
- `-c 8192` - Context window size
|
||||
- `-m` - Path to GGUF model file
|
||||
|
||||
Enable with:
|
||||
```bash
|
||||
systemctl --user enable --now llama-server
|
||||
```
|
||||
|
||||
### Flynn Configuration
|
||||
|
||||
```yaml
|
||||
models:
|
||||
local:
|
||||
provider: llamacpp
|
||||
endpoint: http://localhost:8080
|
||||
# auth_token: optional
|
||||
```
|
||||
|
||||
## Files Changed
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `src/models/local/llamacpp.ts` | **New** - LlamaCppClient implementation |
|
||||
| `src/models/local/index.ts` | Export LlamaCppClient |
|
||||
| Router/factory code | Handle `provider: 'llamacpp'` when creating clients |
|
||||
|
||||
## Files Unchanged
|
||||
|
||||
- `src/config/schema.ts` - Already has `llamacpp` provider enum
|
||||
- `src/models/types.ts` - ModelClient interface already fits
|
||||
|
||||
## Testing
|
||||
|
||||
1. Unit tests with mocked HTTP responses
|
||||
2. Integration test against running llama-server (skipped in CI)
|
||||
Reference in New Issue
Block a user