Files
flynn/docs/performance/TUNING.md
T
William Valentin 8a6cd7f559 docs: Add comprehensive documentation for production deployment and contribution
This commit adds 6 new documentation files to fill critical gaps:

- CONTRIBUTING.md: Developer onboarding guide with setup, workflow,
  code style, testing, and adding features

- TROUBLESHOOTING.md: Common issues and solutions for errors,
  model issues, tool issues, channel issues, gateway issues,
  configuration issues, and memory/database issues

- docs/api/PROTOCOL.md: Gateway JSON-RPC protocol documentation
  with connection, authentication, message format, methods,
  events, error codes, and example client implementation

- docs/api/TOOLS.md: Tools API documentation covering tool interface,
  input schema format, result format, tool patterns,
  tool registration, tool policy, execution flow, and
  builtin tools reference

- docs/deployment/PRODUCTION.md: Production deployment guide
  covering Docker deployment, systemd service, security,
  configuration, monitoring, backup & recovery, and
  performance tuning

- docs/performance/TUNING.md: Performance optimization guide
  covering context management, model routing, tool execution,
  memory & embeddings, session management, database
  performance, gateway performance, and resource usage

These files complement the existing excellent documentation
(README.md, AGENTS.md, ARCHITECTURE.md, STRUCTURE.md,
CONVENTIONS.md) to provide complete coverage for users,
developers, and operators.
2026-02-13 16:07:29 -08:00

847 lines
15 KiB
Markdown

# Performance Tuning Guide
This guide covers performance optimization techniques for Flynn in production environments.
## Table of Contents
- [Overview](#overview)
- [Context Management](#context-management)
- [Model Routing](#model-routing)
- [Tool Execution](#tool-execution)
- [Memory & Embeddings](#memory--embeddings)
- [Session Management](#session-management)
- [Database Performance](#database-performance)
- [Gateway Performance](#gateway-performance)
- [Resource Usage](#resource-usage)
- [Monitoring & Profiling](#monitoring--profiling)
## Overview
Flynn's performance depends on several factors:
1. **Context window efficiency**: How efficiently tokens are used
2. **Model selection**: Choosing the right model for each task
3. **Tool execution**: Fast, reliable tool responses
4. **I/O operations**: Database and file system access
5. **Concurrency**: Handling multiple simultaneous requests
### Performance Goals
- **Response time**: < 5 seconds for simple queries
- **Context efficiency**: > 80% token utilization
- **Throughput**: 10-20 concurrent conversations
- **Resource usage**: < 2GB memory, < 50% CPU
## Context Management
### Compaction Settings
Context compaction prevents conversations from exceeding model context windows.
```yaml
agents:
default:
compaction:
# Trigger compaction at 75% of context window
thresholdPct: 75
# Keep last 6 turns (user + assistant pairs)
keepTurns: 6
# Allow 2048 tokens for summary
summaryMaxTokens: 2048
# Preserve high-importance messages
importanceThreshold: 0.8
```
### Tuning Guidelines
**For fast interactions:**
```yaml
thresholdPct: 60 # Compact early
keepTurns: 2 # Minimal history
summaryMaxTokens: 512 # Short summaries
```
**For complex reasoning:**
```yaml
thresholdPct: 85 # Maximize context
keepTurns: 10 # More history
summaryMaxTokens: 4096 # Detailed summaries
```
### Context Depth Levels
Control how much context is injected into the system prompt:
```yaml
prompt:
contextDepth: 'normal' # minimal | normal | detailed | debug
```
- `minimal`: Only basic system prompt
- `normal`: System prompt + basic memory
- `detailed`: Full memory + tool descriptions
- `debug`: Verbose context (development only)
### Token Counting
Flynn uses rule-based token estimation (fast but approximate).
**Enable tokenizer for accuracy (slower):**
```typescript
// Currently not implemented
// Future: Use tiktoken or similar for exact token counts
```
## Model Routing
### Tier Configuration
Optimize model tiers for cost and latency:
```yaml
models:
router:
tiers:
# Fast, cheap: Quick tasks, delegated calls
fast: 'anthropic:claude-haiku-4-20250514'
# Default: General conversation
default: 'anthropic:claude-sonnet-4-20250514'
# Complex: Deep reasoning, analysis
complex: 'anthropic:claude-opus-4-20250514'
# Fallback: Local models when cloud fails
local: 'ollama:llama3'
```
### Delegation Tasks
Map delegation tasks to appropriate tiers:
```yaml
agents:
default:
delegation:
tiers:
compaction: 'fast' # Summarize history
memoryExtraction: 'fast' # Extract facts
classification: 'default' # Classify intent
toolSummarization: 'default' # Summarize tool results
complexReasoning: 'complex' # Deep analysis
```
### Fallback Chains
Configure fallback chains for resilience:
```yaml
models:
router:
# Try same model on different provider
tierFallbacks:
default:
- 'github:claude-sonnet-4-5'
- 'openai:gpt-4o-mini'
# Global fallback when all tiers fail
fallbackChain:
- 'github:claude-sonnet-4-5'
- 'local:ollama:llama3'
```
### Retry Configuration
Optimize retry behavior for different scenarios:
```yaml
models:
router:
retry:
# More retries for transient failures
maxAttempts: 3
# Start with 1s delay
initialDelayMs: 1000
# Exponential backoff
multiplier: 2
# Max 30s between retries
maxDelayMs: 30000
# Don't retry auth errors
nonRetryablePatterns:
- 'invalid_api_key'
- 'permission_denied'
- 'rate_limit_exceeded'
```
**For production reliability:**
```yaml
maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000
```
### Cost Estimation
Monitor token usage and costs:
```typescript
// Model costs (examples)
const MODEL_COSTS = {
'anthropic:claude-sonnet-4-20250514': {
input: 3.0, // $3 per 1M input tokens
output: 15.0 // $15 per 1M output tokens
},
'anthropic:claude-haiku-4-20250514': {
input: 0.25,
output: 1.25
}
};
```
Track usage with `AgentOrchestrator.getUsageStats()`.
## Tool Execution
### Timeout Configuration
Set appropriate timeouts for different tool types:
```yaml
tools:
executor:
# Default 30s timeout
defaultTimeoutMs: 30000
# Max 50KB output
maxOutputBytes: 51200
```
**For long-running tools:**
```yaml
tools:
executor:
defaultTimeoutMs: 60000 # 60s
```
**For fast tools:**
```yaml
tools:
executor:
defaultTimeoutMs: 10000 # 10s
```
### Caching (Future)
Implement caching for repeated operations:
```yaml
# Not yet implemented
tools:
cache:
enabled: true
ttl: 300 # 5 minutes
maxSize: 1000
excludePatterns:
- 'shell.exec'
- 'process.*'
```
### Sandbox Performance
Docker sandbox adds overhead. Optimize:
```yaml
sandbox:
enabled: true
image: 'node:22-alpine'
# Resource limits
resourceLimits:
memory: '512m'
cpus: '0.5'
timeoutSec: 60
# Use host networking if safe
networkMode: 'host' # Faster than bridge mode
```
**For best performance:**
```yaml
sandbox:
enabled: false # Disable if not needed
```
### Parallel Tool Execution
Flynn executes tools sequentially. For parallel execution:
```typescript
// Future enhancement
const results = await Promise.all([
toolRegistry.execute('tool1', args1),
toolRegistry.execute('tool2', args2),
toolRegistry.execute('tool3', args3)
]);
```
## Memory & Embeddings
### Embedding Provider Selection
Choose embedding provider based on latency and cost:
```yaml
memory:
embeddings:
provider: 'openai' # openai | gemini | ollama | llamacpp | voyage
openai:
apiKey: '${OPENAI_API_KEY}'
model: 'text-embedding-3-small' # Fastest
# Alternative: Local embeddings
ollama:
host: 'localhost:11434'
model: 'nomic-embed-text'
```
**Latency comparison:**
- OpenAI `text-embedding-3-small`: ~100ms
- Gemini: ~200ms
- Ollama `nomic-embed-text`: ~500ms (local)
- llama.cpp: ~300ms (local)
### Text Chunking
Optimize chunking for better search:
```yaml
memory:
embeddings:
chunking:
# Smaller chunks for precision
maxChunkSize: 512
# Overlap for context preservation
chunkOverlap: 50
# Don't chunk small documents
minChunkSize: 128
```
**For fast indexing:**
```yaml
maxChunkSize: 1024
chunkOverlap: 100
```
**For precise search:**
```yaml
maxChunkSize: 256
chunkOverlap: 25
```
### Hybrid Search Tuning
Balance keyword and vector search:
```yaml
memory:
search:
# Weight vector search higher
vectorWeight: 0.7
keywordWeight: 0.3
# Return top results
limit: 10
# Minimum relevance threshold
threshold: 0.5
```
**For keyword-heavy queries:**
```yaml
vectorWeight: 0.4
keywordWeight: 0.6
```
**For semantic queries:**
```yaml
vectorWeight: 0.8
keywordWeight: 0.2
```
### Embedding Caching
Cache embeddings to avoid recomputation:
```yaml
memory:
embeddings:
cache:
enabled: true
ttl: 86400 # 24 hours
```
## Session Management
### TTL Configuration
Set appropriate session TTLs:
```yaml
sessions:
ttl: '7d' # Keep sessions for 7 days
# Maximum concurrent sessions
maxSessions: 100
```
**For memory efficiency:**
```yaml
ttl: '1d'
maxSessions: 50
```
**For long-term memory:**
```yaml
ttl: '30d'
maxSessions: 200
```
### Session Pruning
Prune old sessions regularly:
```yaml
automation:
sessionPruner:
enabled: true
interval: '1h' # Run every hour
# Prune sessions older than TTL
pruneOlderThan: '7d'
```
### Session Indexing
Optimize session search with indexes:
```sql
-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);
```
## Database Performance
### SQLite Configuration
Optimize SQLite for Flynn's workload:
```bash
# In SQLite connection setup
PRAGMA journal_mode = WAL; -- Better concurrency
PRAGMA synchronous = NORMAL; -- Faster writes
PRAGMA cache_size = -64000; -- 64MB cache
PRAGMA temp_store = MEMORY; -- Store temp data in memory
PRAGMA mmap_size = 268435456; -- 256MB mmap
PRAGMA page_size = 4096; -- Default page size
```
### Connection Pooling
Flynn uses single SQLite connection per database. For high concurrency, consider:
```typescript
// Future: Connection pool
import Database from 'better-sqlite3';
const pool = new ConnectionPool({
filename: '/path/to/database.db',
maxConnections: 10
});
```
### Query Optimization
Use indexed columns in queries:
```typescript
// Good: Uses index
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE last_active_at > ?
ORDER BY last_active_at DESC
LIMIT 10
`).all(threshold);
// Bad: Full table scan
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE message_count > ?
`).all(threshold);
```
### Vacuum and Analyze
Regular maintenance improves performance:
```bash
# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"
# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"
# Rebuild indexes
sqlite3 sessions.db "REINDEX;"
```
Add to crontab (monthly):
```
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1
```
## Gateway Performance
### Connection Limits
Limit concurrent connections:
```yaml
gateway:
enabled: true
port: 18800
# Maximum concurrent WebSocket connections
maxConnections: 50
# Single-client lock
lock:
enabled: true # Only one client at a time
```
**For multiple users:**
```yaml
gateway:
maxConnections: 100
lock:
enabled: false
```
### Lane Queue
The lane queue serializes requests per session:
```yaml
gateway:
laneQueue:
# Max requests per session
maxDepth: 10
# Request timeout
requestTimeoutMs: 30000
```
### WebSocket Optimization
Configure WebSocket for performance:
```typescript
// Gateway server WebSocket options
const wsOptions = {
// Enable compression
perMessageDeflate: {
threshold: 1024
},
// Ping interval (heartbeat)
clientTracking: true,
// Maximum message size
maxPayload: 16 * 1024 * 1024 // 16MB
};
```
### HTTP Server
Optimize HTTP server for static files:
```yaml
gateway:
static:
# Enable gzip compression
gzip: true
# Cache static assets
cacheControl: 'public, max-age=3600'
# Serve index.html for SPA routes
spa: true
```
## Resource Usage
### Node.js Options
Tune Node.js for production:
```bash
# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"
# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"
```
In systemd service:
```ini
Environment="NODE_OPTIONS=--max-old-space-size=4096"
```
### Process Limits
Set appropriate limits:
```ini
[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0
# CPU quota (200% = 2 cores)
CPUQuota=200%
# File descriptors
LimitNOFILE=65536
```
### Docker Resource Limits
Constrain Docker container:
```yaml
services:
flynn:
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '1.0'
memory: 1G
```
### Memory Monitoring
Monitor memory usage:
```bash
# Check Flynn memory
ps aux | grep flynn
# System memory
free -h
# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');
```
## Monitoring & Profiling
### Health Checks
Enable gateway health endpoint:
```yaml
automation:
heartbeat:
enabled: true
interval: '5m'
checks:
- 'gateway'
- 'model'
- 'channels'
- 'memory'
- 'disk'
```
Check health:
```bash
curl http://localhost:18800/health
```
### Logging Levels
Configure logging appropriately:
```yaml
logging:
level: 'info' # debug | info | warn | error
```
**Development:** `debug` - All messages
**Production:** `info` - Normal operation
**Minimal:** `warn` - Only warnings and errors
### Performance Metrics
Track key metrics:
```typescript
// Future: Metrics collection
interface Metrics {
// Response times
avgResponseTime: number;
p95ResponseTime: number;
p99ResponseTime: number;
// Throughput
requestsPerSecond: number;
concurrentSessions: number;
// Token usage
avgInputTokens: number;
avgOutputTokens: number;
totalTokens: number;
// Errors
errorRate: number;
timeoutRate: number;
}
```
### Profiling
Profile Node.js execution:
```bash
# Generate CPU profile
node --prof dist/cli/index.js start
# Process profile
node --prof-process isolate-*.log > profile.txt
# Analyze with Chrome DevTools
# Open chrome://inspect and load profile
```
### Flamegraphs
Generate flamegraphs for bottleneck analysis:
```bash
# Install 0x
npm install -g 0x
# Run with profiler
0x dist/cli/index.js start
```
## Common Performance Issues
### High Memory Usage
**Symptoms:**
- OOM errors
- Slow garbage collection
- System swapping
**Solutions:**
1. Reduce `keepTurns` in compaction
2. Decrease session TTL
3. Prune old sessions
4. Increase Node.js memory limit
5. Check for memory leaks
### Slow Response Times
**Symptoms:**
- Responses > 10 seconds
- Timeouts
- Poor user experience
**Solutions:**
1. Switch to faster model tier
2. Enable compaction
3. Use local fallbacks
4. Optimize tool timeouts
5. Check network latency
### High CPU Usage
**Symptoms:**
- CPU > 80%
- Slow system
- High latency
**Solutions:**
1. Reduce concurrent sessions
2. Optimize database queries
3. Use efficient embeddings
4. Disable unnecessary features
5. Scale vertically (more CPU)
### Database Locks
**Symptoms:**
- SQLite database locked errors
- Slow writes
- Concurrent access issues
**Solutions:**
1. Enable WAL mode
2. Reduce write frequency
3. Use connection pooling
4. Add appropriate indexes
### Model Rate Limits
**Symptoms:**
- 429 Too Many Requests errors
- Frequent fallbacks
- Increased latency
**Solutions:**
1. Configure retry with exponential backoff
2. Use faster models for delegated tasks
3. Implement request queuing
4. Add local model fallbacks
## Performance Checklist
Before deploying to production, verify:
- [ ] Compaction configured with appropriate threshold
- [ ] Model tiers configured for cost/latency
- [ ] Fallback chains configured
- [ ] Tool timeouts set appropriately
- [ ] Session TTL reasonable for use case
- [ ] SQLite optimized (WAL mode, cache size)
- [ ] Database indexes created
- [ ] Gateway connection limits set
- [ ] Memory limits configured
- [ ] Monitoring enabled
- [ ] Logging level set to `info` or `warn`
- [ ] Health checks working
- [ ] Backup/restore tested
---
For more information:
- [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md)
- [PRODUCTION.md](../deployment/PRODUCTION.md)
- [ARCHITECTURE.md](../../.planning/codebase/ARCHITECTURE.md)