# Performance Tuning Guide This guide covers performance optimization techniques for Flynn in production environments. ## Table of Contents - [Overview](#overview) - [Context Management](#context-management) - [Model Routing](#model-routing) - [Tool Execution](#tool-execution) - [Memory & Embeddings](#memory--embeddings) - [Session Management](#session-management) - [Database Performance](#database-performance) - [Gateway Performance](#gateway-performance) - [Resource Usage](#resource-usage) - [Monitoring & Profiling](#monitoring--profiling) ## Overview Flynn's performance depends on several factors: 1. **Context window efficiency**: How efficiently tokens are used 2. **Model selection**: Choosing the right model for each task 3. **Tool execution**: Fast, reliable tool responses 4. **I/O operations**: Database and file system access 5. **Concurrency**: Handling multiple simultaneous requests ### Performance Goals - **Response time**: < 5 seconds for simple queries - **Context efficiency**: > 80% token utilization - **Throughput**: 10-20 concurrent conversations - **Resource usage**: < 2GB memory, < 50% CPU ## Context Management ### Compaction Settings Context compaction prevents conversations from exceeding model context windows. ```yaml agents: default: compaction: # Trigger compaction at 75% of context window thresholdPct: 75 # Keep last 6 turns (user + assistant pairs) keepTurns: 6 # Allow 2048 tokens for summary summaryMaxTokens: 2048 # Preserve high-importance messages importanceThreshold: 0.8 ``` ### Tuning Guidelines **For fast interactions:** ```yaml thresholdPct: 60 # Compact early keepTurns: 2 # Minimal history summaryMaxTokens: 512 # Short summaries ``` **For complex reasoning:** ```yaml thresholdPct: 85 # Maximize context keepTurns: 10 # More history summaryMaxTokens: 4096 # Detailed summaries ``` ### Proactive Compaction Signals Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry: ```yaml compaction: proactive: enabled: true warn_pct: 75 checkpoint_pct: 85 auto_compact_pct: 95 checkpoint_cooldown_ms: 300000 memory_namespace: session/checkpoints ``` ### Context Depth Levels Control how much context is injected into the system prompt: ```yaml prompt: contextDepth: 'normal' # minimal | normal | detailed | debug ``` - `minimal`: Only basic system prompt - `normal`: System prompt + basic memory - `detailed`: Full memory + tool descriptions - `debug`: Verbose context (development only) ### Token Counting Flynn uses rule-based token estimation (fast but approximate). **Enable tokenizer for accuracy (slower):** ```typescript // Currently not implemented // Future: Use tiktoken or similar for exact token counts ``` ## Model Routing ### Tier Configuration Optimize model tiers for cost and latency: ```yaml models: router: tiers: # Fast, cheap: Quick tasks, delegated calls fast: 'anthropic:claude-haiku-4-20250514' # Default: General conversation default: 'anthropic:claude-sonnet-4-20250514' # Complex: Deep reasoning, analysis complex: 'anthropic:claude-opus-4-20250514' # Fallback: Local models when cloud fails local: 'ollama:llama3' ``` ### Delegation Tasks Map delegation tasks to appropriate tiers: ```yaml agents: default: delegation: tiers: compaction: 'fast' # Summarize history memoryExtraction: 'fast' # Extract facts classification: 'default' # Classify intent toolSummarization: 'default' # Summarize tool results complexReasoning: 'complex' # Deep analysis ``` ### Fallback Chains Configure fallback chains for resilience: ```yaml models: router: # Try same model on different provider tierFallbacks: default: - 'github:claude-sonnet-4-5' - 'openai:gpt-4o-mini' # Global fallback when all tiers fail fallbackChain: - 'github:claude-sonnet-4-5' - 'local:ollama:llama3' ``` ### Retry Configuration Optimize retry behavior for different scenarios: ```yaml models: router: retry: # More retries for transient failures maxAttempts: 3 # Start with 1s delay initialDelayMs: 1000 # Exponential backoff multiplier: 2 # Max 30s between retries maxDelayMs: 30000 # Don't retry auth errors nonRetryablePatterns: - 'invalid_api_key' - 'permission_denied' - 'rate_limit_exceeded' ``` **For production reliability:** ```yaml maxAttempts: 5 initialDelayMs: 500 multiplier: 1.5 maxDelayMs: 60000 ``` ### Cost Estimation Monitor token usage and costs: ```typescript // Model costs (examples) const MODEL_COSTS = { 'anthropic:claude-sonnet-4-20250514': { input: 3.0, // $3 per 1M input tokens output: 15.0 // $15 per 1M output tokens }, 'anthropic:claude-haiku-4-20250514': { input: 0.25, output: 1.25 } }; ``` Track usage with `AgentOrchestrator.getUsageStats()`. ## Tool Execution ### Timeout Configuration Set appropriate timeouts for different tool types: ```yaml tools: executor: # Default 30s timeout defaultTimeoutMs: 30000 # Max 50KB output maxOutputBytes: 51200 ``` **For long-running tools:** ```yaml tools: executor: defaultTimeoutMs: 60000 # 60s ``` **For fast tools:** ```yaml tools: executor: defaultTimeoutMs: 10000 # 10s ``` ### Caching (Future) Implement caching for repeated operations: ```yaml # Not yet implemented tools: cache: enabled: true ttl: 300 # 5 minutes maxSize: 1000 excludePatterns: - 'shell.exec' - 'process.*' ``` ### Sandbox Performance Docker sandbox adds overhead. Optimize: ```yaml sandbox: enabled: true image: 'node:22-alpine' # Resource limits resourceLimits: memory: '512m' cpus: '0.5' timeoutSec: 60 # Use host networking if safe networkMode: 'host' # Faster than bridge mode ``` **For best performance:** ```yaml sandbox: enabled: false # Disable if not needed ``` ### Parallel Tool Execution Flynn executes tools sequentially. For parallel execution: ```typescript // Future enhancement const results = await Promise.all([ toolRegistry.execute('tool1', args1), toolRegistry.execute('tool2', args2), toolRegistry.execute('tool3', args3) ]); ``` ## Memory & Embeddings ### Embedding Provider Selection Choose embedding provider based on latency and cost: ```yaml memory: embeddings: provider: 'openai' # openai | gemini | ollama | llamacpp | voyage openai: apiKey: '${OPENAI_API_KEY}' model: 'text-embedding-3-small' # Fastest # Alternative: Local embeddings ollama: host: 'localhost:11434' model: 'nomic-embed-text' ``` **Latency comparison:** - OpenAI `text-embedding-3-small`: ~100ms - Gemini: ~200ms - Ollama `nomic-embed-text`: ~500ms (local) - llama.cpp: ~300ms (local) ### Text Chunking Optimize chunking for better search: ```yaml memory: embeddings: chunking: # Smaller chunks for precision maxChunkSize: 512 # Overlap for context preservation chunkOverlap: 50 # Don't chunk small documents minChunkSize: 128 ``` **For fast indexing:** ```yaml maxChunkSize: 1024 chunkOverlap: 100 ``` **For precise search:** ```yaml maxChunkSize: 256 chunkOverlap: 25 ``` ### Hybrid Search Tuning Balance keyword and vector search: ```yaml memory: search: # Weight vector search higher vectorWeight: 0.7 keywordWeight: 0.3 # Return top results limit: 10 # Minimum relevance threshold threshold: 0.5 ``` **For keyword-heavy queries:** ```yaml vectorWeight: 0.4 keywordWeight: 0.6 ``` **For semantic queries:** ```yaml vectorWeight: 0.8 keywordWeight: 0.2 ``` ### Embedding Caching Cache embeddings to avoid recomputation: ```yaml memory: embeddings: cache: enabled: true ttl: 86400 # 24 hours ``` ## Session Management ### TTL Configuration Set appropriate session TTLs: ```yaml sessions: ttl: '7d' # Keep sessions for 7 days # Maximum concurrent sessions maxSessions: 100 ``` **For memory efficiency:** ```yaml ttl: '1d' maxSessions: 50 ``` **For long-term memory:** ```yaml ttl: '30d' maxSessions: 200 ``` ### Session Pruning Prune old sessions regularly: ```yaml automation: sessionPruner: enabled: true interval: '1h' # Run every hour # Prune sessions older than TTL pruneOlderThan: '7d' ``` ### Session Indexing Optimize session search with indexes: ```sql -- SQLite indexes CREATE INDEX idx_sessions_created_at ON sessions(created_at); CREATE INDEX idx_sessions_last_active ON sessions(last_active_at); CREATE INDEX idx_messages_session_id ON messages(session_id); ``` ## Database Performance ### SQLite Configuration Optimize SQLite for Flynn's workload: ```bash # In SQLite connection setup PRAGMA journal_mode = WAL; -- Better concurrency PRAGMA synchronous = NORMAL; -- Faster writes PRAGMA cache_size = -64000; -- 64MB cache PRAGMA temp_store = MEMORY; -- Store temp data in memory PRAGMA mmap_size = 268435456; -- 256MB mmap PRAGMA page_size = 4096; -- Default page size ``` ### Connection Pooling Flynn uses single SQLite connection per database. For high concurrency, consider: ```typescript // Future: Connection pool import Database from 'better-sqlite3'; const pool = new ConnectionPool({ filename: '/path/to/database.db', maxConnections: 10 }); ``` ### Query Optimization Use indexed columns in queries: ```typescript // Good: Uses index const sessions = db.prepare(` SELECT * FROM sessions WHERE last_active_at > ? ORDER BY last_active_at DESC LIMIT 10 `).all(threshold); // Bad: Full table scan const sessions = db.prepare(` SELECT * FROM sessions WHERE message_count > ? `).all(threshold); ``` ### Vacuum and Analyze Regular maintenance improves performance: ```bash # Vacuum to reclaim space sqlite3 sessions.db "VACUUM;" # Analyze for query optimization sqlite3 sessions.db "ANALYZE;" # Rebuild indexes sqlite3 sessions.db "REINDEX;" ``` Add to crontab (monthly): ``` 0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1 ``` ## Gateway Performance ### Connection Limits Limit concurrent connections: ```yaml gateway: enabled: true port: 18800 # Maximum concurrent WebSocket connections maxConnections: 50 # Single-client lock lock: enabled: true # Only one client at a time ``` **For multiple users:** ```yaml gateway: maxConnections: 100 lock: enabled: false ``` ### Lane Queue The lane queue serializes requests per session: ```yaml gateway: laneQueue: # Max requests per session maxDepth: 10 # Request timeout requestTimeoutMs: 30000 ``` ### WebSocket Optimization Configure WebSocket for performance: ```typescript // Gateway server WebSocket options const wsOptions = { // Enable compression perMessageDeflate: { threshold: 1024 }, // Ping interval (heartbeat) clientTracking: true, // Maximum message size maxPayload: 16 * 1024 * 1024 // 16MB }; ``` ### HTTP Server Optimize HTTP server for static files: ```yaml gateway: static: # Enable gzip compression gzip: true # Cache static assets cacheControl: 'public, max-age=3600' # Serve index.html for SPA routes spa: true ``` ## Resource Usage ### Node.js Options Tune Node.js for production: ```bash # Increase memory limit export NODE_OPTIONS="--max-old-space-size=4096" # Enable optimizations export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100" ``` In systemd service: ```ini Environment="NODE_OPTIONS=--max-old-space-size=4096" ``` ### Process Limits Set appropriate limits: ```ini [Service] # Memory limit (2GB) MemoryLimit=2G MemorySwap=0 # CPU quota (200% = 2 cores) CPUQuota=200% # File descriptors LimitNOFILE=65536 ``` ### Docker Resource Limits Constrain Docker container: ```yaml services: flynn: deploy: resources: limits: cpus: '2.0' memory: 2G reservations: cpus: '1.0' memory: 1G ``` ### Memory Monitoring Monitor memory usage: ```bash # Check Flynn memory ps aux | grep flynn # System memory free -h # Node.js heap stats (add to code) console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB'); ``` ## Monitoring & Profiling ### Health Checks Enable gateway health endpoint: ```yaml automation: heartbeat: enabled: true interval: '5m' checks: - 'gateway' - 'model' - 'channels' - 'memory' - 'disk' ``` Check health: ```bash curl http://localhost:18800/health ``` ### Logging Levels Configure logging appropriately: ```yaml logging: level: 'info' # debug | info | warn | error ``` **Development:** `debug` - All messages **Production:** `info` - Normal operation **Minimal:** `warn` - Only warnings and errors ### Performance Metrics Track key metrics: ```typescript // Future: Metrics collection interface Metrics { // Response times avgResponseTime: number; p95ResponseTime: number; p99ResponseTime: number; // Throughput requestsPerSecond: number; concurrentSessions: number; // Token usage avgInputTokens: number; avgOutputTokens: number; totalTokens: number; // Errors errorRate: number; timeoutRate: number; } ``` ### Profiling Profile Node.js execution: ```bash # Generate CPU profile node --prof dist/cli/index.js start # Process profile node --prof-process isolate-*.log > profile.txt # Analyze with Chrome DevTools # Open chrome://inspect and load profile ``` ### Flamegraphs Generate flamegraphs for bottleneck analysis: ```bash # Install 0x npm install -g 0x # Run with profiler 0x dist/cli/index.js start ``` ## Common Performance Issues ### High Memory Usage **Symptoms:** - OOM errors - Slow garbage collection - System swapping **Solutions:** 1. Reduce `keepTurns` in compaction 2. Decrease session TTL 3. Prune old sessions 4. Increase Node.js memory limit 5. Check for memory leaks ### Slow Response Times **Symptoms:** - Responses > 10 seconds - Timeouts - Poor user experience **Solutions:** 1. Switch to faster model tier 2. Enable compaction 3. Use local fallbacks 4. Optimize tool timeouts 5. Check network latency ### High CPU Usage **Symptoms:** - CPU > 80% - Slow system - High latency **Solutions:** 1. Reduce concurrent sessions 2. Optimize database queries 3. Use efficient embeddings 4. Disable unnecessary features 5. Scale vertically (more CPU) ### Database Locks **Symptoms:** - SQLite database locked errors - Slow writes - Concurrent access issues **Solutions:** 1. Enable WAL mode 2. Reduce write frequency 3. Use connection pooling 4. Add appropriate indexes ### Model Rate Limits **Symptoms:** - 429 Too Many Requests errors - Frequent fallbacks - Increased latency **Solutions:** 1. Configure retry with exponential backoff 2. Use faster models for delegated tasks 3. Implement request queuing 4. Add local model fallbacks ## Performance Checklist Before deploying to production, verify: - [ ] Compaction configured with appropriate threshold - [ ] Model tiers configured for cost/latency - [ ] Fallback chains configured - [ ] Tool timeouts set appropriately - [ ] Session TTL reasonable for use case - [ ] SQLite optimized (WAL mode, cache size) - [ ] Database indexes created - [ ] Gateway connection limits set - [ ] Memory limits configured - [ ] Monitoring enabled - [ ] Logging level set to `info` or `warn` - [ ] Health checks working - [ ] Backup/restore tested --- For more information: - [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md) - [PRODUCTION.md](../deployment/PRODUCTION.md) - [ARCHITECTURE.md](../../.planning/codebase/ARCHITECTURE.md)