Files
2026-02-16 15:44:13 -08:00

862 lines
16 KiB
Markdown

# Performance Tuning Guide
This guide covers performance optimization techniques for Flynn in production environments.
## Table of Contents
- [Overview](#overview)
- [Context Management](#context-management)
- [Model Routing](#model-routing)
- [Tool Execution](#tool-execution)
- [Memory & Embeddings](#memory--embeddings)
- [Session Management](#session-management)
- [Database Performance](#database-performance)
- [Gateway Performance](#gateway-performance)
- [Resource Usage](#resource-usage)
- [Monitoring & Profiling](#monitoring--profiling)
## Overview
Flynn's performance depends on several factors:
1. **Context window efficiency**: How efficiently tokens are used
2. **Model selection**: Choosing the right model for each task
3. **Tool execution**: Fast, reliable tool responses
4. **I/O operations**: Database and file system access
5. **Concurrency**: Handling multiple simultaneous requests
### Performance Goals
- **Response time**: < 5 seconds for simple queries
- **Context efficiency**: > 80% token utilization
- **Throughput**: 10-20 concurrent conversations
- **Resource usage**: < 2GB memory, < 50% CPU
## Context Management
### Compaction Settings
Context compaction prevents conversations from exceeding model context windows.
```yaml
agents:
default:
compaction:
# Trigger compaction at 75% of context window
thresholdPct: 75
# Keep last 6 turns (user + assistant pairs)
keepTurns: 6
# Allow 2048 tokens for summary
summaryMaxTokens: 2048
# Preserve high-importance messages
importanceThreshold: 0.8
```
### Tuning Guidelines
**For fast interactions:**
```yaml
thresholdPct: 60 # Compact early
keepTurns: 2 # Minimal history
summaryMaxTokens: 512 # Short summaries
```
**For complex reasoning:**
```yaml
thresholdPct: 85 # Maximize context
keepTurns: 10 # More history
summaryMaxTokens: 4096 # Detailed summaries
```
### Proactive Compaction Signals
Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry:
```yaml
compaction:
proactive:
enabled: true
warn_pct: 75
checkpoint_pct: 85
auto_compact_pct: 95
checkpoint_cooldown_ms: 300000
memory_namespace: session/checkpoints
```
### Context Depth Levels
Control how much context is injected into the system prompt:
```yaml
prompt:
contextDepth: 'normal' # minimal | normal | detailed | debug
```
- `minimal`: Only basic system prompt
- `normal`: System prompt + basic memory
- `detailed`: Full memory + tool descriptions
- `debug`: Verbose context (development only)
### Token Counting
Flynn uses rule-based token estimation (fast but approximate).
**Enable tokenizer for accuracy (slower):**
```typescript
// Currently not implemented
// Future: Use tiktoken or similar for exact token counts
```
## Model Routing
### Tier Configuration
Optimize model tiers for cost and latency:
```yaml
models:
router:
tiers:
# Fast, cheap: Quick tasks, delegated calls
fast: 'anthropic:claude-haiku-4-20250514'
# Default: General conversation
default: 'anthropic:claude-sonnet-4-20250514'
# Complex: Deep reasoning, analysis
complex: 'anthropic:claude-opus-4-20250514'
# Fallback: Local models when cloud fails
local: 'ollama:llama3'
```
### Delegation Tasks
Map delegation tasks to appropriate tiers:
```yaml
agents:
default:
delegation:
tiers:
compaction: 'fast' # Summarize history
memoryExtraction: 'fast' # Extract facts
classification: 'default' # Classify intent
toolSummarization: 'default' # Summarize tool results
complexReasoning: 'complex' # Deep analysis
```
### Fallback Chains
Configure fallback chains for resilience:
```yaml
models:
router:
# Try same model on different provider
tierFallbacks:
default:
- 'github:claude-sonnet-4-5'
- 'openai:gpt-4o-mini'
# Global fallback when all tiers fail
fallbackChain:
- 'github:claude-sonnet-4-5'
- 'local:ollama:llama3'
```
### Retry Configuration
Optimize retry behavior for different scenarios:
```yaml
models:
router:
retry:
# More retries for transient failures
maxAttempts: 3
# Start with 1s delay
initialDelayMs: 1000
# Exponential backoff
multiplier: 2
# Max 30s between retries
maxDelayMs: 30000
# Don't retry auth errors
nonRetryablePatterns:
- 'invalid_api_key'
- 'permission_denied'
- 'rate_limit_exceeded'
```
**For production reliability:**
```yaml
maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000
```
### Cost Estimation
Monitor token usage and costs:
```typescript
// Model costs (examples)
const MODEL_COSTS = {
'anthropic:claude-sonnet-4-20250514': {
input: 3.0, // $3 per 1M input tokens
output: 15.0 // $15 per 1M output tokens
},
'anthropic:claude-haiku-4-20250514': {
input: 0.25,
output: 1.25
}
};
```
Track usage with `AgentOrchestrator.getUsageStats()`.
## Tool Execution
### Timeout Configuration
Set appropriate timeouts for different tool types:
```yaml
tools:
executor:
# Default 30s timeout
defaultTimeoutMs: 30000
# Max 50KB output
maxOutputBytes: 51200
```
**For long-running tools:**
```yaml
tools:
executor:
defaultTimeoutMs: 60000 # 60s
```
**For fast tools:**
```yaml
tools:
executor:
defaultTimeoutMs: 10000 # 10s
```
### Caching (Future)
Implement caching for repeated operations:
```yaml
# Not yet implemented
tools:
cache:
enabled: true
ttl: 300 # 5 minutes
maxSize: 1000
excludePatterns:
- 'shell.exec'
- 'process.*'
```
### Sandbox Performance
Docker sandbox adds overhead. Optimize:
```yaml
sandbox:
enabled: true
image: 'node:22-alpine'
# Resource limits
resourceLimits:
memory: '512m'
cpus: '0.5'
timeoutSec: 60
# Use host networking if safe
networkMode: 'host' # Faster than bridge mode
```
**For best performance:**
```yaml
sandbox:
enabled: false # Disable if not needed
```
### Parallel Tool Execution
Flynn executes tools sequentially. For parallel execution:
```typescript
// Future enhancement
const results = await Promise.all([
toolRegistry.execute('tool1', args1),
toolRegistry.execute('tool2', args2),
toolRegistry.execute('tool3', args3)
]);
```
## Memory & Embeddings
### Embedding Provider Selection
Choose embedding provider based on latency and cost:
```yaml
memory:
embeddings:
provider: 'openai' # openai | gemini | ollama | llamacpp | voyage
openai:
apiKey: '${OPENAI_API_KEY}'
model: 'text-embedding-3-small' # Fastest
# Alternative: Local embeddings
ollama:
host: 'localhost:11434'
model: 'nomic-embed-text'
```
**Latency comparison:**
- OpenAI `text-embedding-3-small`: ~100ms
- Gemini: ~200ms
- Ollama `nomic-embed-text`: ~500ms (local)
- llama.cpp: ~300ms (local)
### Text Chunking
Optimize chunking for better search:
```yaml
memory:
embeddings:
chunking:
# Smaller chunks for precision
maxChunkSize: 512
# Overlap for context preservation
chunkOverlap: 50
# Don't chunk small documents
minChunkSize: 128
```
**For fast indexing:**
```yaml
maxChunkSize: 1024
chunkOverlap: 100
```
**For precise search:**
```yaml
maxChunkSize: 256
chunkOverlap: 25
```
### Hybrid Search Tuning
Balance keyword and vector search:
```yaml
memory:
search:
# Weight vector search higher
vectorWeight: 0.7
keywordWeight: 0.3
# Return top results
limit: 10
# Minimum relevance threshold
threshold: 0.5
```
**For keyword-heavy queries:**
```yaml
vectorWeight: 0.4
keywordWeight: 0.6
```
**For semantic queries:**
```yaml
vectorWeight: 0.8
keywordWeight: 0.2
```
### Embedding Caching
Cache embeddings to avoid recomputation:
```yaml
memory:
embeddings:
cache:
enabled: true
ttl: 86400 # 24 hours
```
## Session Management
### TTL Configuration
Set appropriate session TTLs:
```yaml
sessions:
ttl: '7d' # Keep sessions for 7 days
# Maximum concurrent sessions
maxSessions: 100
```
**For memory efficiency:**
```yaml
ttl: '1d'
maxSessions: 50
```
**For long-term memory:**
```yaml
ttl: '30d'
maxSessions: 200
```
### Session Pruning
Prune old sessions regularly:
```yaml
automation:
sessionPruner:
enabled: true
interval: '1h' # Run every hour
# Prune sessions older than TTL
pruneOlderThan: '7d'
```
### Session Indexing
Optimize session search with indexes:
```sql
-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);
```
## Database Performance
### SQLite Configuration
Optimize SQLite for Flynn's workload:
```bash
# In SQLite connection setup
PRAGMA journal_mode = WAL; -- Better concurrency
PRAGMA synchronous = NORMAL; -- Faster writes
PRAGMA cache_size = -64000; -- 64MB cache
PRAGMA temp_store = MEMORY; -- Store temp data in memory
PRAGMA mmap_size = 268435456; -- 256MB mmap
PRAGMA page_size = 4096; -- Default page size
```
### Connection Pooling
Flynn uses single SQLite connection per database. For high concurrency, consider:
```typescript
// Future: Connection pool
import Database from 'better-sqlite3';
const pool = new ConnectionPool({
filename: '/path/to/database.db',
maxConnections: 10
});
```
### Query Optimization
Use indexed columns in queries:
```typescript
// Good: Uses index
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE last_active_at > ?
ORDER BY last_active_at DESC
LIMIT 10
`).all(threshold);
// Bad: Full table scan
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE message_count > ?
`).all(threshold);
```
### Vacuum and Analyze
Regular maintenance improves performance:
```bash
# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"
# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"
# Rebuild indexes
sqlite3 sessions.db "REINDEX;"
```
Add to crontab (monthly):
```
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1
```
## Gateway Performance
### Connection Limits
Limit concurrent connections:
```yaml
gateway:
enabled: true
port: 18800
# Maximum concurrent WebSocket connections
maxConnections: 50
# Single-client lock
lock:
enabled: true # Only one client at a time
```
**For multiple users:**
```yaml
gateway:
maxConnections: 100
lock:
enabled: false
```
### Lane Queue
The lane queue serializes requests per session:
```yaml
gateway:
laneQueue:
# Max requests per session
maxDepth: 10
# Request timeout
requestTimeoutMs: 30000
```
### WebSocket Optimization
Configure WebSocket for performance:
```typescript
// Gateway server WebSocket options
const wsOptions = {
// Enable compression
perMessageDeflate: {
threshold: 1024
},
// Ping interval (heartbeat)
clientTracking: true,
// Maximum message size
maxPayload: 16 * 1024 * 1024 // 16MB
};
```
### HTTP Server
Optimize HTTP server for static files:
```yaml
gateway:
static:
# Enable gzip compression
gzip: true
# Cache static assets
cacheControl: 'public, max-age=3600'
# Serve index.html for SPA routes
spa: true
```
## Resource Usage
### Node.js Options
Tune Node.js for production:
```bash
# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"
# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"
```
In systemd service:
```ini
Environment="NODE_OPTIONS=--max-old-space-size=4096"
```
### Process Limits
Set appropriate limits:
```ini
[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0
# CPU quota (200% = 2 cores)
CPUQuota=200%
# File descriptors
LimitNOFILE=65536
```
### Docker Resource Limits
Constrain Docker container:
```yaml
services:
flynn:
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '1.0'
memory: 1G
```
### Memory Monitoring
Monitor memory usage:
```bash
# Check Flynn memory
ps aux | grep flynn
# System memory
free -h
# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');
```
## Monitoring & Profiling
### Health Checks
Enable gateway health endpoint:
```yaml
automation:
heartbeat:
enabled: true
interval: '5m'
checks:
- 'gateway'
- 'model'
- 'channels'
- 'memory'
- 'disk'
```
Check health:
```bash
curl http://localhost:18800/health
```
### Logging Levels
Configure logging appropriately:
```yaml
logging:
level: 'info' # debug | info | warn | error
```
**Development:** `debug` - All messages
**Production:** `info` - Normal operation
**Minimal:** `warn` - Only warnings and errors
### Performance Metrics
Track key metrics:
```typescript
// Future: Metrics collection
interface Metrics {
// Response times
avgResponseTime: number;
p95ResponseTime: number;
p99ResponseTime: number;
// Throughput
requestsPerSecond: number;
concurrentSessions: number;
// Token usage
avgInputTokens: number;
avgOutputTokens: number;
totalTokens: number;
// Errors
errorRate: number;
timeoutRate: number;
}
```
### Profiling
Profile Node.js execution:
```bash
# Generate CPU profile
node --prof dist/cli/index.js start
# Process profile
node --prof-process isolate-*.log > profile.txt
# Analyze with Chrome DevTools
# Open chrome://inspect and load profile
```
### Flamegraphs
Generate flamegraphs for bottleneck analysis:
```bash
# Install 0x
npm install -g 0x
# Run with profiler
0x dist/cli/index.js start
```
## Common Performance Issues
### High Memory Usage
**Symptoms:**
- OOM errors
- Slow garbage collection
- System swapping
**Solutions:**
1. Reduce `keepTurns` in compaction
2. Decrease session TTL
3. Prune old sessions
4. Increase Node.js memory limit
5. Check for memory leaks
### Slow Response Times
**Symptoms:**
- Responses > 10 seconds
- Timeouts
- Poor user experience
**Solutions:**
1. Switch to faster model tier
2. Enable compaction
3. Use local fallbacks
4. Optimize tool timeouts
5. Check network latency
### High CPU Usage
**Symptoms:**
- CPU > 80%
- Slow system
- High latency
**Solutions:**
1. Reduce concurrent sessions
2. Optimize database queries
3. Use efficient embeddings
4. Disable unnecessary features
5. Scale vertically (more CPU)
### Database Locks
**Symptoms:**
- SQLite database locked errors
- Slow writes
- Concurrent access issues
**Solutions:**
1. Enable WAL mode
2. Reduce write frequency
3. Use connection pooling
4. Add appropriate indexes
### Model Rate Limits
**Symptoms:**
- 429 Too Many Requests errors
- Frequent fallbacks
- Increased latency
**Solutions:**
1. Configure retry with exponential backoff
2. Use faster models for delegated tasks
3. Implement request queuing
4. Add local model fallbacks
## Performance Checklist
Before deploying to production, verify:
- [ ] Compaction configured with appropriate threshold
- [ ] Model tiers configured for cost/latency
- [ ] Fallback chains configured
- [ ] Tool timeouts set appropriately
- [ ] Session TTL reasonable for use case
- [ ] SQLite optimized (WAL mode, cache size)
- [ ] Database indexes created
- [ ] Gateway connection limits set
- [ ] Memory limits configured
- [ ] Monitoring enabled
- [ ] Logging level set to `info` or `warn`
- [ ] Health checks working
- [ ] Backup/restore tested
---
For more information:
- [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md)
- [PRODUCTION.md](../deployment/PRODUCTION.md)
- [ARCHITECTURE.md](../../.planning/codebase/ARCHITECTURE.md)