862 lines
16 KiB
Markdown
862 lines
16 KiB
Markdown
# Performance Tuning Guide
|
|
|
|
This guide covers performance optimization techniques for Flynn in production environments.
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Context Management](#context-management)
|
|
- [Model Routing](#model-routing)
|
|
- [Tool Execution](#tool-execution)
|
|
- [Memory & Embeddings](#memory--embeddings)
|
|
- [Session Management](#session-management)
|
|
- [Database Performance](#database-performance)
|
|
- [Gateway Performance](#gateway-performance)
|
|
- [Resource Usage](#resource-usage)
|
|
- [Monitoring & Profiling](#monitoring--profiling)
|
|
|
|
## Overview
|
|
|
|
Flynn's performance depends on several factors:
|
|
|
|
1. **Context window efficiency**: How efficiently tokens are used
|
|
2. **Model selection**: Choosing the right model for each task
|
|
3. **Tool execution**: Fast, reliable tool responses
|
|
4. **I/O operations**: Database and file system access
|
|
5. **Concurrency**: Handling multiple simultaneous requests
|
|
|
|
### Performance Goals
|
|
|
|
- **Response time**: < 5 seconds for simple queries
|
|
- **Context efficiency**: > 80% token utilization
|
|
- **Throughput**: 10-20 concurrent conversations
|
|
- **Resource usage**: < 2GB memory, < 50% CPU
|
|
|
|
## Context Management
|
|
|
|
### Compaction Settings
|
|
|
|
Context compaction prevents conversations from exceeding model context windows.
|
|
|
|
```yaml
|
|
agents:
|
|
default:
|
|
compaction:
|
|
# Trigger compaction at 75% of context window
|
|
thresholdPct: 75
|
|
|
|
# Keep last 6 turns (user + assistant pairs)
|
|
keepTurns: 6
|
|
|
|
# Allow 2048 tokens for summary
|
|
summaryMaxTokens: 2048
|
|
|
|
# Preserve high-importance messages
|
|
importanceThreshold: 0.8
|
|
```
|
|
|
|
### Tuning Guidelines
|
|
|
|
**For fast interactions:**
|
|
```yaml
|
|
thresholdPct: 60 # Compact early
|
|
keepTurns: 2 # Minimal history
|
|
summaryMaxTokens: 512 # Short summaries
|
|
```
|
|
|
|
**For complex reasoning:**
|
|
```yaml
|
|
thresholdPct: 85 # Maximize context
|
|
keepTurns: 10 # More history
|
|
summaryMaxTokens: 4096 # Detailed summaries
|
|
```
|
|
|
|
### Proactive Compaction Signals
|
|
|
|
Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry:
|
|
|
|
```yaml
|
|
compaction:
|
|
proactive:
|
|
enabled: true
|
|
warn_pct: 75
|
|
checkpoint_pct: 85
|
|
auto_compact_pct: 95
|
|
checkpoint_cooldown_ms: 300000
|
|
memory_namespace: session/checkpoints
|
|
```
|
|
|
|
### Context Depth Levels
|
|
|
|
Control how much context is injected into the system prompt:
|
|
|
|
```yaml
|
|
prompt:
|
|
contextDepth: 'normal' # minimal | normal | detailed | debug
|
|
```
|
|
|
|
- `minimal`: Only basic system prompt
|
|
- `normal`: System prompt + basic memory
|
|
- `detailed`: Full memory + tool descriptions
|
|
- `debug`: Verbose context (development only)
|
|
|
|
### Token Counting
|
|
|
|
Flynn uses rule-based token estimation (fast but approximate).
|
|
|
|
**Enable tokenizer for accuracy (slower):**
|
|
|
|
```typescript
|
|
// Currently not implemented
|
|
// Future: Use tiktoken or similar for exact token counts
|
|
```
|
|
|
|
## Model Routing
|
|
|
|
### Tier Configuration
|
|
|
|
Optimize model tiers for cost and latency:
|
|
|
|
```yaml
|
|
models:
|
|
router:
|
|
tiers:
|
|
# Fast, cheap: Quick tasks, delegated calls
|
|
fast: 'anthropic:claude-haiku-4-20250514'
|
|
|
|
# Default: General conversation
|
|
default: 'anthropic:claude-sonnet-4-20250514'
|
|
|
|
# Complex: Deep reasoning, analysis
|
|
complex: 'anthropic:claude-opus-4-20250514'
|
|
|
|
# Fallback: Local models when cloud fails
|
|
local: 'ollama:llama3'
|
|
```
|
|
|
|
### Delegation Tasks
|
|
|
|
Map delegation tasks to appropriate tiers:
|
|
|
|
```yaml
|
|
agents:
|
|
default:
|
|
delegation:
|
|
tiers:
|
|
compaction: 'fast' # Summarize history
|
|
memoryExtraction: 'fast' # Extract facts
|
|
classification: 'default' # Classify intent
|
|
toolSummarization: 'default' # Summarize tool results
|
|
complexReasoning: 'complex' # Deep analysis
|
|
```
|
|
|
|
### Fallback Chains
|
|
|
|
Configure fallback chains for resilience:
|
|
|
|
```yaml
|
|
models:
|
|
router:
|
|
# Try same model on different provider
|
|
tierFallbacks:
|
|
default:
|
|
- 'github:claude-sonnet-4-5'
|
|
- 'openai:gpt-4o-mini'
|
|
|
|
# Global fallback when all tiers fail
|
|
fallbackChain:
|
|
- 'github:claude-sonnet-4-5'
|
|
- 'local:ollama:llama3'
|
|
```
|
|
|
|
### Retry Configuration
|
|
|
|
Optimize retry behavior for different scenarios:
|
|
|
|
```yaml
|
|
models:
|
|
router:
|
|
retry:
|
|
# More retries for transient failures
|
|
maxAttempts: 3
|
|
|
|
# Start with 1s delay
|
|
initialDelayMs: 1000
|
|
|
|
# Exponential backoff
|
|
multiplier: 2
|
|
|
|
# Max 30s between retries
|
|
maxDelayMs: 30000
|
|
|
|
# Don't retry auth errors
|
|
nonRetryablePatterns:
|
|
- 'invalid_api_key'
|
|
- 'permission_denied'
|
|
- 'rate_limit_exceeded'
|
|
```
|
|
|
|
**For production reliability:**
|
|
```yaml
|
|
maxAttempts: 5
|
|
initialDelayMs: 500
|
|
multiplier: 1.5
|
|
maxDelayMs: 60000
|
|
```
|
|
|
|
### Cost Estimation
|
|
|
|
Monitor token usage and costs:
|
|
|
|
```typescript
|
|
// Model costs (examples)
|
|
const MODEL_COSTS = {
|
|
'anthropic:claude-sonnet-4-20250514': {
|
|
input: 3.0, // $3 per 1M input tokens
|
|
output: 15.0 // $15 per 1M output tokens
|
|
},
|
|
'anthropic:claude-haiku-4-20250514': {
|
|
input: 0.25,
|
|
output: 1.25
|
|
}
|
|
};
|
|
```
|
|
|
|
Track usage with `AgentOrchestrator.getUsageStats()`.
|
|
|
|
## Tool Execution
|
|
|
|
### Timeout Configuration
|
|
|
|
Set appropriate timeouts for different tool types:
|
|
|
|
```yaml
|
|
tools:
|
|
executor:
|
|
# Default 30s timeout
|
|
defaultTimeoutMs: 30000
|
|
|
|
# Max 50KB output
|
|
maxOutputBytes: 51200
|
|
```
|
|
|
|
**For long-running tools:**
|
|
```yaml
|
|
tools:
|
|
executor:
|
|
defaultTimeoutMs: 60000 # 60s
|
|
```
|
|
|
|
**For fast tools:**
|
|
```yaml
|
|
tools:
|
|
executor:
|
|
defaultTimeoutMs: 10000 # 10s
|
|
```
|
|
|
|
### Caching (Future)
|
|
|
|
Implement caching for repeated operations:
|
|
|
|
```yaml
|
|
# Not yet implemented
|
|
tools:
|
|
cache:
|
|
enabled: true
|
|
ttl: 300 # 5 minutes
|
|
maxSize: 1000
|
|
excludePatterns:
|
|
- 'shell.exec'
|
|
- 'process.*'
|
|
```
|
|
|
|
### Sandbox Performance
|
|
|
|
Docker sandbox adds overhead. Optimize:
|
|
|
|
```yaml
|
|
sandbox:
|
|
enabled: true
|
|
image: 'node:22-alpine'
|
|
|
|
# Resource limits
|
|
resourceLimits:
|
|
memory: '512m'
|
|
cpus: '0.5'
|
|
timeoutSec: 60
|
|
|
|
# Use host networking if safe
|
|
networkMode: 'host' # Faster than bridge mode
|
|
```
|
|
|
|
**For best performance:**
|
|
```yaml
|
|
sandbox:
|
|
enabled: false # Disable if not needed
|
|
```
|
|
|
|
### Parallel Tool Execution
|
|
|
|
Flynn executes tools sequentially. For parallel execution:
|
|
|
|
```typescript
|
|
// Future enhancement
|
|
const results = await Promise.all([
|
|
toolRegistry.execute('tool1', args1),
|
|
toolRegistry.execute('tool2', args2),
|
|
toolRegistry.execute('tool3', args3)
|
|
]);
|
|
```
|
|
|
|
## Memory & Embeddings
|
|
|
|
### Embedding Provider Selection
|
|
|
|
Choose embedding provider based on latency and cost:
|
|
|
|
```yaml
|
|
memory:
|
|
embeddings:
|
|
provider: 'openai' # openai | gemini | ollama | llamacpp | voyage
|
|
|
|
openai:
|
|
apiKey: '${OPENAI_API_KEY}'
|
|
model: 'text-embedding-3-small' # Fastest
|
|
|
|
# Alternative: Local embeddings
|
|
ollama:
|
|
host: 'localhost:11434'
|
|
model: 'nomic-embed-text'
|
|
```
|
|
|
|
**Latency comparison:**
|
|
- OpenAI `text-embedding-3-small`: ~100ms
|
|
- Gemini: ~200ms
|
|
- Ollama `nomic-embed-text`: ~500ms (local)
|
|
- llama.cpp: ~300ms (local)
|
|
|
|
### Text Chunking
|
|
|
|
Optimize chunking for better search:
|
|
|
|
```yaml
|
|
memory:
|
|
embeddings:
|
|
chunking:
|
|
# Smaller chunks for precision
|
|
maxChunkSize: 512
|
|
|
|
# Overlap for context preservation
|
|
chunkOverlap: 50
|
|
|
|
# Don't chunk small documents
|
|
minChunkSize: 128
|
|
```
|
|
|
|
**For fast indexing:**
|
|
```yaml
|
|
maxChunkSize: 1024
|
|
chunkOverlap: 100
|
|
```
|
|
|
|
**For precise search:**
|
|
```yaml
|
|
maxChunkSize: 256
|
|
chunkOverlap: 25
|
|
```
|
|
|
|
### Hybrid Search Tuning
|
|
|
|
Balance keyword and vector search:
|
|
|
|
```yaml
|
|
memory:
|
|
search:
|
|
# Weight vector search higher
|
|
vectorWeight: 0.7
|
|
keywordWeight: 0.3
|
|
|
|
# Return top results
|
|
limit: 10
|
|
|
|
# Minimum relevance threshold
|
|
threshold: 0.5
|
|
```
|
|
|
|
**For keyword-heavy queries:**
|
|
```yaml
|
|
vectorWeight: 0.4
|
|
keywordWeight: 0.6
|
|
```
|
|
|
|
**For semantic queries:**
|
|
```yaml
|
|
vectorWeight: 0.8
|
|
keywordWeight: 0.2
|
|
```
|
|
|
|
### Embedding Caching
|
|
|
|
Cache embeddings to avoid recomputation:
|
|
|
|
```yaml
|
|
memory:
|
|
embeddings:
|
|
cache:
|
|
enabled: true
|
|
ttl: 86400 # 24 hours
|
|
```
|
|
|
|
## Session Management
|
|
|
|
### TTL Configuration
|
|
|
|
Set appropriate session TTLs:
|
|
|
|
```yaml
|
|
sessions:
|
|
ttl: '7d' # Keep sessions for 7 days
|
|
|
|
# Maximum concurrent sessions
|
|
maxSessions: 100
|
|
```
|
|
|
|
**For memory efficiency:**
|
|
```yaml
|
|
ttl: '1d'
|
|
maxSessions: 50
|
|
```
|
|
|
|
**For long-term memory:**
|
|
```yaml
|
|
ttl: '30d'
|
|
maxSessions: 200
|
|
```
|
|
|
|
### Session Pruning
|
|
|
|
Prune old sessions regularly:
|
|
|
|
```yaml
|
|
automation:
|
|
sessionPruner:
|
|
enabled: true
|
|
interval: '1h' # Run every hour
|
|
|
|
# Prune sessions older than TTL
|
|
pruneOlderThan: '7d'
|
|
```
|
|
|
|
### Session Indexing
|
|
|
|
Optimize session search with indexes:
|
|
|
|
```sql
|
|
-- SQLite indexes
|
|
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
|
|
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
|
|
CREATE INDEX idx_messages_session_id ON messages(session_id);
|
|
```
|
|
|
|
## Database Performance
|
|
|
|
### SQLite Configuration
|
|
|
|
Optimize SQLite for Flynn's workload:
|
|
|
|
```bash
|
|
# In SQLite connection setup
|
|
PRAGMA journal_mode = WAL; -- Better concurrency
|
|
PRAGMA synchronous = NORMAL; -- Faster writes
|
|
PRAGMA cache_size = -64000; -- 64MB cache
|
|
PRAGMA temp_store = MEMORY; -- Store temp data in memory
|
|
PRAGMA mmap_size = 268435456; -- 256MB mmap
|
|
PRAGMA page_size = 4096; -- Default page size
|
|
```
|
|
|
|
### Connection Pooling
|
|
|
|
Flynn uses single SQLite connection per database. For high concurrency, consider:
|
|
|
|
```typescript
|
|
// Future: Connection pool
|
|
import Database from 'better-sqlite3';
|
|
|
|
const pool = new ConnectionPool({
|
|
filename: '/path/to/database.db',
|
|
maxConnections: 10
|
|
});
|
|
```
|
|
|
|
### Query Optimization
|
|
|
|
Use indexed columns in queries:
|
|
|
|
```typescript
|
|
// Good: Uses index
|
|
const sessions = db.prepare(`
|
|
SELECT * FROM sessions
|
|
WHERE last_active_at > ?
|
|
ORDER BY last_active_at DESC
|
|
LIMIT 10
|
|
`).all(threshold);
|
|
|
|
// Bad: Full table scan
|
|
const sessions = db.prepare(`
|
|
SELECT * FROM sessions
|
|
WHERE message_count > ?
|
|
`).all(threshold);
|
|
```
|
|
|
|
### Vacuum and Analyze
|
|
|
|
Regular maintenance improves performance:
|
|
|
|
```bash
|
|
# Vacuum to reclaim space
|
|
sqlite3 sessions.db "VACUUM;"
|
|
|
|
# Analyze for query optimization
|
|
sqlite3 sessions.db "ANALYZE;"
|
|
|
|
# Rebuild indexes
|
|
sqlite3 sessions.db "REINDEX;"
|
|
```
|
|
|
|
Add to crontab (monthly):
|
|
```
|
|
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1
|
|
```
|
|
|
|
## Gateway Performance
|
|
|
|
### Connection Limits
|
|
|
|
Limit concurrent connections:
|
|
|
|
```yaml
|
|
gateway:
|
|
enabled: true
|
|
port: 18800
|
|
|
|
# Maximum concurrent WebSocket connections
|
|
maxConnections: 50
|
|
|
|
# Single-client lock
|
|
lock:
|
|
enabled: true # Only one client at a time
|
|
```
|
|
|
|
**For multiple users:**
|
|
```yaml
|
|
gateway:
|
|
maxConnections: 100
|
|
lock:
|
|
enabled: false
|
|
```
|
|
|
|
### Lane Queue
|
|
|
|
The lane queue serializes requests per session:
|
|
|
|
```yaml
|
|
gateway:
|
|
laneQueue:
|
|
# Max requests per session
|
|
maxDepth: 10
|
|
|
|
# Request timeout
|
|
requestTimeoutMs: 30000
|
|
```
|
|
|
|
### WebSocket Optimization
|
|
|
|
Configure WebSocket for performance:
|
|
|
|
```typescript
|
|
// Gateway server WebSocket options
|
|
const wsOptions = {
|
|
// Enable compression
|
|
perMessageDeflate: {
|
|
threshold: 1024
|
|
},
|
|
|
|
// Ping interval (heartbeat)
|
|
clientTracking: true,
|
|
|
|
// Maximum message size
|
|
maxPayload: 16 * 1024 * 1024 // 16MB
|
|
};
|
|
```
|
|
|
|
### HTTP Server
|
|
|
|
Optimize HTTP server for static files:
|
|
|
|
```yaml
|
|
gateway:
|
|
static:
|
|
# Enable gzip compression
|
|
gzip: true
|
|
|
|
# Cache static assets
|
|
cacheControl: 'public, max-age=3600'
|
|
|
|
# Serve index.html for SPA routes
|
|
spa: true
|
|
```
|
|
|
|
## Resource Usage
|
|
|
|
### Node.js Options
|
|
|
|
Tune Node.js for production:
|
|
|
|
```bash
|
|
# Increase memory limit
|
|
export NODE_OPTIONS="--max-old-space-size=4096"
|
|
|
|
# Enable optimizations
|
|
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"
|
|
```
|
|
|
|
In systemd service:
|
|
```ini
|
|
Environment="NODE_OPTIONS=--max-old-space-size=4096"
|
|
```
|
|
|
|
### Process Limits
|
|
|
|
Set appropriate limits:
|
|
|
|
```ini
|
|
[Service]
|
|
# Memory limit (2GB)
|
|
MemoryLimit=2G
|
|
MemorySwap=0
|
|
|
|
# CPU quota (200% = 2 cores)
|
|
CPUQuota=200%
|
|
|
|
# File descriptors
|
|
LimitNOFILE=65536
|
|
```
|
|
|
|
### Docker Resource Limits
|
|
|
|
Constrain Docker container:
|
|
|
|
```yaml
|
|
services:
|
|
flynn:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '2.0'
|
|
memory: 2G
|
|
reservations:
|
|
cpus: '1.0'
|
|
memory: 1G
|
|
```
|
|
|
|
### Memory Monitoring
|
|
|
|
Monitor memory usage:
|
|
|
|
```bash
|
|
# Check Flynn memory
|
|
ps aux | grep flynn
|
|
|
|
# System memory
|
|
free -h
|
|
|
|
# Node.js heap stats (add to code)
|
|
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');
|
|
```
|
|
|
|
## Monitoring & Profiling
|
|
|
|
### Health Checks
|
|
|
|
Enable gateway health endpoint:
|
|
|
|
```yaml
|
|
automation:
|
|
heartbeat:
|
|
enabled: true
|
|
interval: '5m'
|
|
checks:
|
|
- 'gateway'
|
|
- 'model'
|
|
- 'channels'
|
|
- 'memory'
|
|
- 'disk'
|
|
```
|
|
|
|
Check health:
|
|
```bash
|
|
curl http://localhost:18800/health
|
|
```
|
|
|
|
### Logging Levels
|
|
|
|
Configure logging appropriately:
|
|
|
|
```yaml
|
|
logging:
|
|
level: 'info' # debug | info | warn | error
|
|
```
|
|
|
|
**Development:** `debug` - All messages
|
|
**Production:** `info` - Normal operation
|
|
**Minimal:** `warn` - Only warnings and errors
|
|
|
|
### Performance Metrics
|
|
|
|
Track key metrics:
|
|
|
|
```typescript
|
|
// Future: Metrics collection
|
|
interface Metrics {
|
|
// Response times
|
|
avgResponseTime: number;
|
|
p95ResponseTime: number;
|
|
p99ResponseTime: number;
|
|
|
|
// Throughput
|
|
requestsPerSecond: number;
|
|
concurrentSessions: number;
|
|
|
|
// Token usage
|
|
avgInputTokens: number;
|
|
avgOutputTokens: number;
|
|
totalTokens: number;
|
|
|
|
// Errors
|
|
errorRate: number;
|
|
timeoutRate: number;
|
|
}
|
|
```
|
|
|
|
### Profiling
|
|
|
|
Profile Node.js execution:
|
|
|
|
```bash
|
|
# Generate CPU profile
|
|
node --prof dist/cli/index.js start
|
|
|
|
# Process profile
|
|
node --prof-process isolate-*.log > profile.txt
|
|
|
|
# Analyze with Chrome DevTools
|
|
# Open chrome://inspect and load profile
|
|
```
|
|
|
|
### Flamegraphs
|
|
|
|
Generate flamegraphs for bottleneck analysis:
|
|
|
|
```bash
|
|
# Install 0x
|
|
npm install -g 0x
|
|
|
|
# Run with profiler
|
|
0x dist/cli/index.js start
|
|
```
|
|
|
|
## Common Performance Issues
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOM errors
|
|
- Slow garbage collection
|
|
- System swapping
|
|
|
|
**Solutions:**
|
|
1. Reduce `keepTurns` in compaction
|
|
2. Decrease session TTL
|
|
3. Prune old sessions
|
|
4. Increase Node.js memory limit
|
|
5. Check for memory leaks
|
|
|
|
### Slow Response Times
|
|
|
|
**Symptoms:**
|
|
- Responses > 10 seconds
|
|
- Timeouts
|
|
- Poor user experience
|
|
|
|
**Solutions:**
|
|
1. Switch to faster model tier
|
|
2. Enable compaction
|
|
3. Use local fallbacks
|
|
4. Optimize tool timeouts
|
|
5. Check network latency
|
|
|
|
### High CPU Usage
|
|
|
|
**Symptoms:**
|
|
- CPU > 80%
|
|
- Slow system
|
|
- High latency
|
|
|
|
**Solutions:**
|
|
1. Reduce concurrent sessions
|
|
2. Optimize database queries
|
|
3. Use efficient embeddings
|
|
4. Disable unnecessary features
|
|
5. Scale vertically (more CPU)
|
|
|
|
### Database Locks
|
|
|
|
**Symptoms:**
|
|
- SQLite database locked errors
|
|
- Slow writes
|
|
- Concurrent access issues
|
|
|
|
**Solutions:**
|
|
1. Enable WAL mode
|
|
2. Reduce write frequency
|
|
3. Use connection pooling
|
|
4. Add appropriate indexes
|
|
|
|
### Model Rate Limits
|
|
|
|
**Symptoms:**
|
|
- 429 Too Many Requests errors
|
|
- Frequent fallbacks
|
|
- Increased latency
|
|
|
|
**Solutions:**
|
|
1. Configure retry with exponential backoff
|
|
2. Use faster models for delegated tasks
|
|
3. Implement request queuing
|
|
4. Add local model fallbacks
|
|
|
|
## Performance Checklist
|
|
|
|
Before deploying to production, verify:
|
|
|
|
- [ ] Compaction configured with appropriate threshold
|
|
- [ ] Model tiers configured for cost/latency
|
|
- [ ] Fallback chains configured
|
|
- [ ] Tool timeouts set appropriately
|
|
- [ ] Session TTL reasonable for use case
|
|
- [ ] SQLite optimized (WAL mode, cache size)
|
|
- [ ] Database indexes created
|
|
- [ ] Gateway connection limits set
|
|
- [ ] Memory limits configured
|
|
- [ ] Monitoring enabled
|
|
- [ ] Logging level set to `info` or `warn`
|
|
- [ ] Health checks working
|
|
- [ ] Backup/restore tested
|
|
|
|
---
|
|
|
|
For more information:
|
|
- [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md)
|
|
- [PRODUCTION.md](../deployment/PRODUCTION.md)
|
|
- [ARCHITECTURE.md](../../.planning/codebase/ARCHITECTURE.md)
|