16 KiB
Performance Tuning Guide
This guide covers performance optimization techniques for Flynn in production environments.
Table of Contents
- Overview
- Context Management
- Model Routing
- Tool Execution
- Memory & Embeddings
- Session Management
- Database Performance
- Gateway Performance
- Resource Usage
- Monitoring & Profiling
Overview
Flynn's performance depends on several factors:
- Context window efficiency: How efficiently tokens are used
- Model selection: Choosing the right model for each task
- Tool execution: Fast, reliable tool responses
- I/O operations: Database and file system access
- Concurrency: Handling multiple simultaneous requests
Performance Goals
- Response time: < 5 seconds for simple queries
- Context efficiency: > 80% token utilization
- Throughput: 10-20 concurrent conversations
- Resource usage: < 2GB memory, < 50% CPU
Context Management
Compaction Settings
Context compaction prevents conversations from exceeding model context windows.
agents:
default:
compaction:
# Trigger compaction at 75% of context window
thresholdPct: 75
# Keep last 6 turns (user + assistant pairs)
keepTurns: 6
# Allow 2048 tokens for summary
summaryMaxTokens: 2048
# Preserve high-importance messages
importanceThreshold: 0.8
Tuning Guidelines
For fast interactions:
thresholdPct: 60 # Compact early
keepTurns: 2 # Minimal history
summaryMaxTokens: 512 # Short summaries
For complex reasoning:
thresholdPct: 85 # Maximize context
keepTurns: 10 # More history
summaryMaxTokens: 4096 # Detailed summaries
Proactive Compaction Signals
Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry:
compaction:
proactive:
enabled: true
warn_pct: 75
checkpoint_pct: 85
auto_compact_pct: 95
checkpoint_cooldown_ms: 300000
memory_namespace: session/checkpoints
Context Depth Levels
Control how much context is injected into the system prompt:
prompt:
contextDepth: 'normal' # minimal | normal | detailed | debug
minimal: Only basic system promptnormal: System prompt + basic memorydetailed: Full memory + tool descriptionsdebug: Verbose context (development only)
Token Counting
Flynn uses rule-based token estimation (fast but approximate).
Enable tokenizer for accuracy (slower):
// Currently not implemented
// Future: Use tiktoken or similar for exact token counts
Model Routing
Tier Configuration
Optimize model tiers for cost and latency:
models:
router:
tiers:
# Fast, cheap: Quick tasks, delegated calls
fast: 'anthropic:claude-haiku-4-20250514'
# Default: General conversation
default: 'anthropic:claude-sonnet-4-20250514'
# Complex: Deep reasoning, analysis
complex: 'anthropic:claude-opus-4-20250514'
# Fallback: Local models when cloud fails
local: 'ollama:llama3'
Delegation Tasks
Map delegation tasks to appropriate tiers:
agents:
default:
delegation:
tiers:
compaction: 'fast' # Summarize history
memoryExtraction: 'fast' # Extract facts
classification: 'default' # Classify intent
toolSummarization: 'default' # Summarize tool results
complexReasoning: 'complex' # Deep analysis
Fallback Chains
Configure fallback chains for resilience:
models:
router:
# Try same model on different provider
tierFallbacks:
default:
- 'github:claude-sonnet-4-5'
- 'openai:gpt-4o-mini'
# Global fallback when all tiers fail
fallbackChain:
- 'github:claude-sonnet-4-5'
- 'local:ollama:llama3'
Retry Configuration
Optimize retry behavior for different scenarios:
models:
router:
retry:
# More retries for transient failures
maxAttempts: 3
# Start with 1s delay
initialDelayMs: 1000
# Exponential backoff
multiplier: 2
# Max 30s between retries
maxDelayMs: 30000
# Don't retry auth errors
nonRetryablePatterns:
- 'invalid_api_key'
- 'permission_denied'
- 'rate_limit_exceeded'
For production reliability:
maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000
Cost Estimation
Monitor token usage and costs:
// Model costs (examples)
const MODEL_COSTS = {
'anthropic:claude-sonnet-4-20250514': {
input: 3.0, // $3 per 1M input tokens
output: 15.0 // $15 per 1M output tokens
},
'anthropic:claude-haiku-4-20250514': {
input: 0.25,
output: 1.25
}
};
Track usage with AgentOrchestrator.getUsageStats().
Tool Execution
Timeout Configuration
Set appropriate timeouts for different tool types:
tools:
executor:
# Default 30s timeout
defaultTimeoutMs: 30000
# Max 50KB output
maxOutputBytes: 51200
For long-running tools:
tools:
executor:
defaultTimeoutMs: 60000 # 60s
For fast tools:
tools:
executor:
defaultTimeoutMs: 10000 # 10s
Caching (Future)
Implement caching for repeated operations:
# Not yet implemented
tools:
cache:
enabled: true
ttl: 300 # 5 minutes
maxSize: 1000
excludePatterns:
- 'shell.exec'
- 'process.*'
Sandbox Performance
Docker sandbox adds overhead. Optimize:
sandbox:
enabled: true
image: 'node:22-alpine'
# Resource limits
resourceLimits:
memory: '512m'
cpus: '0.5'
timeoutSec: 60
# Use host networking if safe
networkMode: 'host' # Faster than bridge mode
For best performance:
sandbox:
enabled: false # Disable if not needed
Parallel Tool Execution
Flynn executes tools sequentially. For parallel execution:
// Future enhancement
const results = await Promise.all([
toolRegistry.execute('tool1', args1),
toolRegistry.execute('tool2', args2),
toolRegistry.execute('tool3', args3)
]);
Memory & Embeddings
Embedding Provider Selection
Choose embedding provider based on latency and cost:
memory:
embeddings:
provider: 'openai' # openai | gemini | ollama | llamacpp | voyage
openai:
apiKey: '${OPENAI_API_KEY}'
model: 'text-embedding-3-small' # Fastest
# Alternative: Local embeddings
ollama:
host: 'localhost:11434'
model: 'nomic-embed-text'
Latency comparison:
- OpenAI
text-embedding-3-small: ~100ms - Gemini: ~200ms
- Ollama
nomic-embed-text: ~500ms (local) - llama.cpp: ~300ms (local)
Text Chunking
Optimize chunking for better search:
memory:
embeddings:
chunking:
# Smaller chunks for precision
maxChunkSize: 512
# Overlap for context preservation
chunkOverlap: 50
# Don't chunk small documents
minChunkSize: 128
For fast indexing:
maxChunkSize: 1024
chunkOverlap: 100
For precise search:
maxChunkSize: 256
chunkOverlap: 25
Hybrid Search Tuning
Balance keyword and vector search:
memory:
search:
# Weight vector search higher
vectorWeight: 0.7
keywordWeight: 0.3
# Return top results
limit: 10
# Minimum relevance threshold
threshold: 0.5
For keyword-heavy queries:
vectorWeight: 0.4
keywordWeight: 0.6
For semantic queries:
vectorWeight: 0.8
keywordWeight: 0.2
Embedding Caching
Cache embeddings to avoid recomputation:
memory:
embeddings:
cache:
enabled: true
ttl: 86400 # 24 hours
Session Management
TTL Configuration
Set appropriate session TTLs:
sessions:
ttl: '7d' # Keep sessions for 7 days
# Maximum concurrent sessions
maxSessions: 100
For memory efficiency:
ttl: '1d'
maxSessions: 50
For long-term memory:
ttl: '30d'
maxSessions: 200
Session Pruning
Prune old sessions regularly:
automation:
sessionPruner:
enabled: true
interval: '1h' # Run every hour
# Prune sessions older than TTL
pruneOlderThan: '7d'
Session Indexing
Optimize session search with indexes:
-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);
Database Performance
SQLite Configuration
Optimize SQLite for Flynn's workload:
# In SQLite connection setup
PRAGMA journal_mode = WAL; -- Better concurrency
PRAGMA synchronous = NORMAL; -- Faster writes
PRAGMA cache_size = -64000; -- 64MB cache
PRAGMA temp_store = MEMORY; -- Store temp data in memory
PRAGMA mmap_size = 268435456; -- 256MB mmap
PRAGMA page_size = 4096; -- Default page size
Connection Pooling
Flynn uses single SQLite connection per database. For high concurrency, consider:
// Future: Connection pool
import Database from 'better-sqlite3';
const pool = new ConnectionPool({
filename: '/path/to/database.db',
maxConnections: 10
});
Query Optimization
Use indexed columns in queries:
// Good: Uses index
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE last_active_at > ?
ORDER BY last_active_at DESC
LIMIT 10
`).all(threshold);
// Bad: Full table scan
const sessions = db.prepare(`
SELECT * FROM sessions
WHERE message_count > ?
`).all(threshold);
Vacuum and Analyze
Regular maintenance improves performance:
# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"
# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"
# Rebuild indexes
sqlite3 sessions.db "REINDEX;"
Add to crontab (monthly):
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1
Gateway Performance
Connection Limits
Limit concurrent connections:
gateway:
enabled: true
port: 18800
# Maximum concurrent WebSocket connections
maxConnections: 50
# Single-client lock
lock:
enabled: true # Only one client at a time
For multiple users:
gateway:
maxConnections: 100
lock:
enabled: false
Lane Queue
The lane queue serializes requests per session:
gateway:
laneQueue:
# Max requests per session
maxDepth: 10
# Request timeout
requestTimeoutMs: 30000
WebSocket Optimization
Configure WebSocket for performance:
// Gateway server WebSocket options
const wsOptions = {
// Enable compression
perMessageDeflate: {
threshold: 1024
},
// Ping interval (heartbeat)
clientTracking: true,
// Maximum message size
maxPayload: 16 * 1024 * 1024 // 16MB
};
HTTP Server
Optimize HTTP server for static files:
gateway:
static:
# Enable gzip compression
gzip: true
# Cache static assets
cacheControl: 'public, max-age=3600'
# Serve index.html for SPA routes
spa: true
Resource Usage
Node.js Options
Tune Node.js for production:
# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"
# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"
In systemd service:
Environment="NODE_OPTIONS=--max-old-space-size=4096"
Process Limits
Set appropriate limits:
[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0
# CPU quota (200% = 2 cores)
CPUQuota=200%
# File descriptors
LimitNOFILE=65536
Docker Resource Limits
Constrain Docker container:
services:
flynn:
deploy:
resources:
limits:
cpus: '2.0'
memory: 2G
reservations:
cpus: '1.0'
memory: 1G
Memory Monitoring
Monitor memory usage:
# Check Flynn memory
ps aux | grep flynn
# System memory
free -h
# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');
Monitoring & Profiling
Health Checks
Enable gateway health endpoint:
automation:
heartbeat:
enabled: true
interval: '5m'
checks:
- 'gateway'
- 'model'
- 'channels'
- 'memory'
- 'disk'
Check health:
curl http://localhost:18800/health
Logging Levels
Configure logging appropriately:
logging:
level: 'info' # debug | info | warn | error
Development: debug - All messages
Production: info - Normal operation
Minimal: warn - Only warnings and errors
Performance Metrics
Track key metrics:
// Future: Metrics collection
interface Metrics {
// Response times
avgResponseTime: number;
p95ResponseTime: number;
p99ResponseTime: number;
// Throughput
requestsPerSecond: number;
concurrentSessions: number;
// Token usage
avgInputTokens: number;
avgOutputTokens: number;
totalTokens: number;
// Errors
errorRate: number;
timeoutRate: number;
}
Profiling
Profile Node.js execution:
# Generate CPU profile
node --prof dist/cli/index.js start
# Process profile
node --prof-process isolate-*.log > profile.txt
# Analyze with Chrome DevTools
# Open chrome://inspect and load profile
Flamegraphs
Generate flamegraphs for bottleneck analysis:
# Install 0x
npm install -g 0x
# Run with profiler
0x dist/cli/index.js start
Common Performance Issues
High Memory Usage
Symptoms:
- OOM errors
- Slow garbage collection
- System swapping
Solutions:
- Reduce
keepTurnsin compaction - Decrease session TTL
- Prune old sessions
- Increase Node.js memory limit
- Check for memory leaks
Slow Response Times
Symptoms:
- Responses > 10 seconds
- Timeouts
- Poor user experience
Solutions:
- Switch to faster model tier
- Enable compaction
- Use local fallbacks
- Optimize tool timeouts
- Check network latency
High CPU Usage
Symptoms:
- CPU > 80%
- Slow system
- High latency
Solutions:
- Reduce concurrent sessions
- Optimize database queries
- Use efficient embeddings
- Disable unnecessary features
- Scale vertically (more CPU)
Database Locks
Symptoms:
- SQLite database locked errors
- Slow writes
- Concurrent access issues
Solutions:
- Enable WAL mode
- Reduce write frequency
- Use connection pooling
- Add appropriate indexes
Model Rate Limits
Symptoms:
- 429 Too Many Requests errors
- Frequent fallbacks
- Increased latency
Solutions:
- Configure retry with exponential backoff
- Use faster models for delegated tasks
- Implement request queuing
- Add local model fallbacks
Performance Checklist
Before deploying to production, verify:
- Compaction configured with appropriate threshold
- Model tiers configured for cost/latency
- Fallback chains configured
- Tool timeouts set appropriately
- Session TTL reasonable for use case
- SQLite optimized (WAL mode, cache size)
- Database indexes created
- Gateway connection limits set
- Memory limits configured
- Monitoring enabled
- Logging level set to
infoorwarn - Health checks working
- Backup/restore tested
For more information: