Files
2026-02-16 15:44:13 -08:00

16 KiB

Performance Tuning Guide

This guide covers performance optimization techniques for Flynn in production environments.

Table of Contents

Overview

Flynn's performance depends on several factors:

  1. Context window efficiency: How efficiently tokens are used
  2. Model selection: Choosing the right model for each task
  3. Tool execution: Fast, reliable tool responses
  4. I/O operations: Database and file system access
  5. Concurrency: Handling multiple simultaneous requests

Performance Goals

  • Response time: < 5 seconds for simple queries
  • Context efficiency: > 80% token utilization
  • Throughput: 10-20 concurrent conversations
  • Resource usage: < 2GB memory, < 50% CPU

Context Management

Compaction Settings

Context compaction prevents conversations from exceeding model context windows.

agents:
  default:
    compaction:
      # Trigger compaction at 75% of context window
      thresholdPct: 75

      # Keep last 6 turns (user + assistant pairs)
      keepTurns: 6

      # Allow 2048 tokens for summary
      summaryMaxTokens: 2048

      # Preserve high-importance messages
      importanceThreshold: 0.8

Tuning Guidelines

For fast interactions:

thresholdPct: 60      # Compact early
keepTurns: 2          # Minimal history
summaryMaxTokens: 512  # Short summaries

For complex reasoning:

thresholdPct: 85      # Maximize context
keepTurns: 10         # More history
summaryMaxTokens: 4096 # Detailed summaries

Proactive Compaction Signals

Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry:

compaction:
  proactive:
    enabled: true
    warn_pct: 75
    checkpoint_pct: 85
    auto_compact_pct: 95
    checkpoint_cooldown_ms: 300000
    memory_namespace: session/checkpoints

Context Depth Levels

Control how much context is injected into the system prompt:

prompt:
  contextDepth: 'normal'  # minimal | normal | detailed | debug
  • minimal: Only basic system prompt
  • normal: System prompt + basic memory
  • detailed: Full memory + tool descriptions
  • debug: Verbose context (development only)

Token Counting

Flynn uses rule-based token estimation (fast but approximate).

Enable tokenizer for accuracy (slower):

// Currently not implemented
// Future: Use tiktoken or similar for exact token counts

Model Routing

Tier Configuration

Optimize model tiers for cost and latency:

models:
  router:
    tiers:
      # Fast, cheap: Quick tasks, delegated calls
      fast: 'anthropic:claude-haiku-4-20250514'

      # Default: General conversation
      default: 'anthropic:claude-sonnet-4-20250514'

      # Complex: Deep reasoning, analysis
      complex: 'anthropic:claude-opus-4-20250514'

      # Fallback: Local models when cloud fails
      local: 'ollama:llama3'

Delegation Tasks

Map delegation tasks to appropriate tiers:

agents:
  default:
    delegation:
      tiers:
        compaction: 'fast'           # Summarize history
        memoryExtraction: 'fast'     # Extract facts
        classification: 'default'     # Classify intent
        toolSummarization: 'default' # Summarize tool results
        complexReasoning: 'complex'  # Deep analysis

Fallback Chains

Configure fallback chains for resilience:

models:
  router:
    # Try same model on different provider
    tierFallbacks:
      default:
        - 'github:claude-sonnet-4-5'
        - 'openai:gpt-4o-mini'

    # Global fallback when all tiers fail
    fallbackChain:
      - 'github:claude-sonnet-4-5'
      - 'local:ollama:llama3'

Retry Configuration

Optimize retry behavior for different scenarios:

models:
  router:
    retry:
      # More retries for transient failures
      maxAttempts: 3

      # Start with 1s delay
      initialDelayMs: 1000

      # Exponential backoff
      multiplier: 2

      # Max 30s between retries
      maxDelayMs: 30000

      # Don't retry auth errors
      nonRetryablePatterns:
        - 'invalid_api_key'
        - 'permission_denied'
        - 'rate_limit_exceeded'

For production reliability:

maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000

Cost Estimation

Monitor token usage and costs:

// Model costs (examples)
const MODEL_COSTS = {
  'anthropic:claude-sonnet-4-20250514': {
    input: 3.0,    // $3 per 1M input tokens
    output: 15.0    // $15 per 1M output tokens
  },
  'anthropic:claude-haiku-4-20250514': {
    input: 0.25,
    output: 1.25
  }
};

Track usage with AgentOrchestrator.getUsageStats().

Tool Execution

Timeout Configuration

Set appropriate timeouts for different tool types:

tools:
  executor:
    # Default 30s timeout
    defaultTimeoutMs: 30000

    # Max 50KB output
    maxOutputBytes: 51200

For long-running tools:

tools:
  executor:
    defaultTimeoutMs: 60000  # 60s

For fast tools:

tools:
  executor:
    defaultTimeoutMs: 10000  # 10s

Caching (Future)

Implement caching for repeated operations:

# Not yet implemented
tools:
  cache:
    enabled: true
    ttl: 300  # 5 minutes
    maxSize: 1000
    excludePatterns:
      - 'shell.exec'
      - 'process.*'

Sandbox Performance

Docker sandbox adds overhead. Optimize:

sandbox:
  enabled: true
  image: 'node:22-alpine'

  # Resource limits
  resourceLimits:
    memory: '512m'
    cpus: '0.5'
    timeoutSec: 60

  # Use host networking if safe
  networkMode: 'host'  # Faster than bridge mode

For best performance:

sandbox:
  enabled: false  # Disable if not needed

Parallel Tool Execution

Flynn executes tools sequentially. For parallel execution:

// Future enhancement
const results = await Promise.all([
  toolRegistry.execute('tool1', args1),
  toolRegistry.execute('tool2', args2),
  toolRegistry.execute('tool3', args3)
]);

Memory & Embeddings

Embedding Provider Selection

Choose embedding provider based on latency and cost:

memory:
  embeddings:
    provider: 'openai'  # openai | gemini | ollama | llamacpp | voyage

    openai:
      apiKey: '${OPENAI_API_KEY}'
      model: 'text-embedding-3-small'  # Fastest

    # Alternative: Local embeddings
    ollama:
      host: 'localhost:11434'
      model: 'nomic-embed-text'

Latency comparison:

  • OpenAI text-embedding-3-small: ~100ms
  • Gemini: ~200ms
  • Ollama nomic-embed-text: ~500ms (local)
  • llama.cpp: ~300ms (local)

Text Chunking

Optimize chunking for better search:

memory:
  embeddings:
    chunking:
      # Smaller chunks for precision
      maxChunkSize: 512

      # Overlap for context preservation
      chunkOverlap: 50

      # Don't chunk small documents
      minChunkSize: 128

For fast indexing:

maxChunkSize: 1024
chunkOverlap: 100

For precise search:

maxChunkSize: 256
chunkOverlap: 25

Hybrid Search Tuning

Balance keyword and vector search:

memory:
  search:
    # Weight vector search higher
    vectorWeight: 0.7
    keywordWeight: 0.3

    # Return top results
    limit: 10

    # Minimum relevance threshold
    threshold: 0.5

For keyword-heavy queries:

vectorWeight: 0.4
keywordWeight: 0.6

For semantic queries:

vectorWeight: 0.8
keywordWeight: 0.2

Embedding Caching

Cache embeddings to avoid recomputation:

memory:
  embeddings:
    cache:
      enabled: true
      ttl: 86400  # 24 hours

Session Management

TTL Configuration

Set appropriate session TTLs:

sessions:
  ttl: '7d'  # Keep sessions for 7 days

  # Maximum concurrent sessions
  maxSessions: 100

For memory efficiency:

ttl: '1d'
maxSessions: 50

For long-term memory:

ttl: '30d'
maxSessions: 200

Session Pruning

Prune old sessions regularly:

automation:
  sessionPruner:
    enabled: true
    interval: '1h'  # Run every hour

    # Prune sessions older than TTL
    pruneOlderThan: '7d'

Session Indexing

Optimize session search with indexes:

-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);

Database Performance

SQLite Configuration

Optimize SQLite for Flynn's workload:

# In SQLite connection setup
PRAGMA journal_mode = WAL;        -- Better concurrency
PRAGMA synchronous = NORMAL;      -- Faster writes
PRAGMA cache_size = -64000;       -- 64MB cache
PRAGMA temp_store = MEMORY;        -- Store temp data in memory
PRAGMA mmap_size = 268435456;     -- 256MB mmap
PRAGMA page_size = 4096;          -- Default page size

Connection Pooling

Flynn uses single SQLite connection per database. For high concurrency, consider:

// Future: Connection pool
import Database from 'better-sqlite3';

const pool = new ConnectionPool({
  filename: '/path/to/database.db',
  maxConnections: 10
});

Query Optimization

Use indexed columns in queries:

// Good: Uses index
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE last_active_at > ?
  ORDER BY last_active_at DESC
  LIMIT 10
`).all(threshold);

// Bad: Full table scan
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE message_count > ?
`).all(threshold);

Vacuum and Analyze

Regular maintenance improves performance:

# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"

# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"

# Rebuild indexes
sqlite3 sessions.db "REINDEX;"

Add to crontab (monthly):

0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1

Gateway Performance

Connection Limits

Limit concurrent connections:

gateway:
  enabled: true
  port: 18800

  # Maximum concurrent WebSocket connections
  maxConnections: 50

  # Single-client lock
  lock:
    enabled: true  # Only one client at a time

For multiple users:

gateway:
  maxConnections: 100
  lock:
    enabled: false

Lane Queue

The lane queue serializes requests per session:

gateway:
  laneQueue:
    # Max requests per session
    maxDepth: 10

    # Request timeout
    requestTimeoutMs: 30000

WebSocket Optimization

Configure WebSocket for performance:

// Gateway server WebSocket options
const wsOptions = {
  // Enable compression
  perMessageDeflate: {
    threshold: 1024
  },

  // Ping interval (heartbeat)
  clientTracking: true,

  // Maximum message size
  maxPayload: 16 * 1024 * 1024  // 16MB
};

HTTP Server

Optimize HTTP server for static files:

gateway:
  static:
    # Enable gzip compression
    gzip: true

    # Cache static assets
    cacheControl: 'public, max-age=3600'

    # Serve index.html for SPA routes
    spa: true

Resource Usage

Node.js Options

Tune Node.js for production:

# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"

# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"

In systemd service:

Environment="NODE_OPTIONS=--max-old-space-size=4096"

Process Limits

Set appropriate limits:

[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0

# CPU quota (200% = 2 cores)
CPUQuota=200%

# File descriptors
LimitNOFILE=65536

Docker Resource Limits

Constrain Docker container:

services:
  flynn:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 1G

Memory Monitoring

Monitor memory usage:

# Check Flynn memory
ps aux | grep flynn

# System memory
free -h

# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');

Monitoring & Profiling

Health Checks

Enable gateway health endpoint:

automation:
  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'

Check health:

curl http://localhost:18800/health

Logging Levels

Configure logging appropriately:

logging:
  level: 'info'  # debug | info | warn | error

Development: debug - All messages Production: info - Normal operation Minimal: warn - Only warnings and errors

Performance Metrics

Track key metrics:

// Future: Metrics collection
interface Metrics {
  // Response times
  avgResponseTime: number;
  p95ResponseTime: number;
  p99ResponseTime: number;

  // Throughput
  requestsPerSecond: number;
  concurrentSessions: number;

  // Token usage
  avgInputTokens: number;
  avgOutputTokens: number;
  totalTokens: number;

  // Errors
  errorRate: number;
  timeoutRate: number;
}

Profiling

Profile Node.js execution:

# Generate CPU profile
node --prof dist/cli/index.js start

# Process profile
node --prof-process isolate-*.log > profile.txt

# Analyze with Chrome DevTools
# Open chrome://inspect and load profile

Flamegraphs

Generate flamegraphs for bottleneck analysis:

# Install 0x
npm install -g 0x

# Run with profiler
0x dist/cli/index.js start

Common Performance Issues

High Memory Usage

Symptoms:

  • OOM errors
  • Slow garbage collection
  • System swapping

Solutions:

  1. Reduce keepTurns in compaction
  2. Decrease session TTL
  3. Prune old sessions
  4. Increase Node.js memory limit
  5. Check for memory leaks

Slow Response Times

Symptoms:

  • Responses > 10 seconds
  • Timeouts
  • Poor user experience

Solutions:

  1. Switch to faster model tier
  2. Enable compaction
  3. Use local fallbacks
  4. Optimize tool timeouts
  5. Check network latency

High CPU Usage

Symptoms:

  • CPU > 80%
  • Slow system
  • High latency

Solutions:

  1. Reduce concurrent sessions
  2. Optimize database queries
  3. Use efficient embeddings
  4. Disable unnecessary features
  5. Scale vertically (more CPU)

Database Locks

Symptoms:

  • SQLite database locked errors
  • Slow writes
  • Concurrent access issues

Solutions:

  1. Enable WAL mode
  2. Reduce write frequency
  3. Use connection pooling
  4. Add appropriate indexes

Model Rate Limits

Symptoms:

  • 429 Too Many Requests errors
  • Frequent fallbacks
  • Increased latency

Solutions:

  1. Configure retry with exponential backoff
  2. Use faster models for delegated tasks
  3. Implement request queuing
  4. Add local model fallbacks

Performance Checklist

Before deploying to production, verify:

  • Compaction configured with appropriate threshold
  • Model tiers configured for cost/latency
  • Fallback chains configured
  • Tool timeouts set appropriately
  • Session TTL reasonable for use case
  • SQLite optimized (WAL mode, cache size)
  • Database indexes created
  • Gateway connection limits set
  • Memory limits configured
  • Monitoring enabled
  • Logging level set to info or warn
  • Health checks working
  • Backup/restore tested

For more information: