Files
flynn/docs/performance/TUNING.md
T
William Valentin 8a6cd7f559 docs: Add comprehensive documentation for production deployment and contribution
This commit adds 6 new documentation files to fill critical gaps:

- CONTRIBUTING.md: Developer onboarding guide with setup, workflow,
  code style, testing, and adding features

- TROUBLESHOOTING.md: Common issues and solutions for errors,
  model issues, tool issues, channel issues, gateway issues,
  configuration issues, and memory/database issues

- docs/api/PROTOCOL.md: Gateway JSON-RPC protocol documentation
  with connection, authentication, message format, methods,
  events, error codes, and example client implementation

- docs/api/TOOLS.md: Tools API documentation covering tool interface,
  input schema format, result format, tool patterns,
  tool registration, tool policy, execution flow, and
  builtin tools reference

- docs/deployment/PRODUCTION.md: Production deployment guide
  covering Docker deployment, systemd service, security,
  configuration, monitoring, backup & recovery, and
  performance tuning

- docs/performance/TUNING.md: Performance optimization guide
  covering context management, model routing, tool execution,
  memory & embeddings, session management, database
  performance, gateway performance, and resource usage

These files complement the existing excellent documentation
(README.md, AGENTS.md, ARCHITECTURE.md, STRUCTURE.md,
CONVENTIONS.md) to provide complete coverage for users,
developers, and operators.
2026-02-13 16:07:29 -08:00

15 KiB

Performance Tuning Guide

This guide covers performance optimization techniques for Flynn in production environments.

Table of Contents

Overview

Flynn's performance depends on several factors:

  1. Context window efficiency: How efficiently tokens are used
  2. Model selection: Choosing the right model for each task
  3. Tool execution: Fast, reliable tool responses
  4. I/O operations: Database and file system access
  5. Concurrency: Handling multiple simultaneous requests

Performance Goals

  • Response time: < 5 seconds for simple queries
  • Context efficiency: > 80% token utilization
  • Throughput: 10-20 concurrent conversations
  • Resource usage: < 2GB memory, < 50% CPU

Context Management

Compaction Settings

Context compaction prevents conversations from exceeding model context windows.

agents:
  default:
    compaction:
      # Trigger compaction at 75% of context window
      thresholdPct: 75

      # Keep last 6 turns (user + assistant pairs)
      keepTurns: 6

      # Allow 2048 tokens for summary
      summaryMaxTokens: 2048

      # Preserve high-importance messages
      importanceThreshold: 0.8

Tuning Guidelines

For fast interactions:

thresholdPct: 60      # Compact early
keepTurns: 2          # Minimal history
summaryMaxTokens: 512  # Short summaries

For complex reasoning:

thresholdPct: 85      # Maximize context
keepTurns: 10         # More history
summaryMaxTokens: 4096 # Detailed summaries

Context Depth Levels

Control how much context is injected into the system prompt:

prompt:
  contextDepth: 'normal'  # minimal | normal | detailed | debug
  • minimal: Only basic system prompt
  • normal: System prompt + basic memory
  • detailed: Full memory + tool descriptions
  • debug: Verbose context (development only)

Token Counting

Flynn uses rule-based token estimation (fast but approximate).

Enable tokenizer for accuracy (slower):

// Currently not implemented
// Future: Use tiktoken or similar for exact token counts

Model Routing

Tier Configuration

Optimize model tiers for cost and latency:

models:
  router:
    tiers:
      # Fast, cheap: Quick tasks, delegated calls
      fast: 'anthropic:claude-haiku-4-20250514'

      # Default: General conversation
      default: 'anthropic:claude-sonnet-4-20250514'

      # Complex: Deep reasoning, analysis
      complex: 'anthropic:claude-opus-4-20250514'

      # Fallback: Local models when cloud fails
      local: 'ollama:llama3'

Delegation Tasks

Map delegation tasks to appropriate tiers:

agents:
  default:
    delegation:
      tiers:
        compaction: 'fast'           # Summarize history
        memoryExtraction: 'fast'     # Extract facts
        classification: 'default'     # Classify intent
        toolSummarization: 'default' # Summarize tool results
        complexReasoning: 'complex'  # Deep analysis

Fallback Chains

Configure fallback chains for resilience:

models:
  router:
    # Try same model on different provider
    tierFallbacks:
      default:
        - 'github:claude-sonnet-4-5'
        - 'openai:gpt-4o-mini'

    # Global fallback when all tiers fail
    fallbackChain:
      - 'github:claude-sonnet-4-5'
      - 'local:ollama:llama3'

Retry Configuration

Optimize retry behavior for different scenarios:

models:
  router:
    retry:
      # More retries for transient failures
      maxAttempts: 3

      # Start with 1s delay
      initialDelayMs: 1000

      # Exponential backoff
      multiplier: 2

      # Max 30s between retries
      maxDelayMs: 30000

      # Don't retry auth errors
      nonRetryablePatterns:
        - 'invalid_api_key'
        - 'permission_denied'
        - 'rate_limit_exceeded'

For production reliability:

maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000

Cost Estimation

Monitor token usage and costs:

// Model costs (examples)
const MODEL_COSTS = {
  'anthropic:claude-sonnet-4-20250514': {
    input: 3.0,    // $3 per 1M input tokens
    output: 15.0    // $15 per 1M output tokens
  },
  'anthropic:claude-haiku-4-20250514': {
    input: 0.25,
    output: 1.25
  }
};

Track usage with AgentOrchestrator.getUsageStats().

Tool Execution

Timeout Configuration

Set appropriate timeouts for different tool types:

tools:
  executor:
    # Default 30s timeout
    defaultTimeoutMs: 30000

    # Max 50KB output
    maxOutputBytes: 51200

For long-running tools:

tools:
  executor:
    defaultTimeoutMs: 60000  # 60s

For fast tools:

tools:
  executor:
    defaultTimeoutMs: 10000  # 10s

Caching (Future)

Implement caching for repeated operations:

# Not yet implemented
tools:
  cache:
    enabled: true
    ttl: 300  # 5 minutes
    maxSize: 1000
    excludePatterns:
      - 'shell.exec'
      - 'process.*'

Sandbox Performance

Docker sandbox adds overhead. Optimize:

sandbox:
  enabled: true
  image: 'node:22-alpine'

  # Resource limits
  resourceLimits:
    memory: '512m'
    cpus: '0.5'
    timeoutSec: 60

  # Use host networking if safe
  networkMode: 'host'  # Faster than bridge mode

For best performance:

sandbox:
  enabled: false  # Disable if not needed

Parallel Tool Execution

Flynn executes tools sequentially. For parallel execution:

// Future enhancement
const results = await Promise.all([
  toolRegistry.execute('tool1', args1),
  toolRegistry.execute('tool2', args2),
  toolRegistry.execute('tool3', args3)
]);

Memory & Embeddings

Embedding Provider Selection

Choose embedding provider based on latency and cost:

memory:
  embeddings:
    provider: 'openai'  # openai | gemini | ollama | llamacpp | voyage

    openai:
      apiKey: '${OPENAI_API_KEY}'
      model: 'text-embedding-3-small'  # Fastest

    # Alternative: Local embeddings
    ollama:
      host: 'localhost:11434'
      model: 'nomic-embed-text'

Latency comparison:

  • OpenAI text-embedding-3-small: ~100ms
  • Gemini: ~200ms
  • Ollama nomic-embed-text: ~500ms (local)
  • llama.cpp: ~300ms (local)

Text Chunking

Optimize chunking for better search:

memory:
  embeddings:
    chunking:
      # Smaller chunks for precision
      maxChunkSize: 512

      # Overlap for context preservation
      chunkOverlap: 50

      # Don't chunk small documents
      minChunkSize: 128

For fast indexing:

maxChunkSize: 1024
chunkOverlap: 100

For precise search:

maxChunkSize: 256
chunkOverlap: 25

Hybrid Search Tuning

Balance keyword and vector search:

memory:
  search:
    # Weight vector search higher
    vectorWeight: 0.7
    keywordWeight: 0.3

    # Return top results
    limit: 10

    # Minimum relevance threshold
    threshold: 0.5

For keyword-heavy queries:

vectorWeight: 0.4
keywordWeight: 0.6

For semantic queries:

vectorWeight: 0.8
keywordWeight: 0.2

Embedding Caching

Cache embeddings to avoid recomputation:

memory:
  embeddings:
    cache:
      enabled: true
      ttl: 86400  # 24 hours

Session Management

TTL Configuration

Set appropriate session TTLs:

sessions:
  ttl: '7d'  # Keep sessions for 7 days

  # Maximum concurrent sessions
  maxSessions: 100

For memory efficiency:

ttl: '1d'
maxSessions: 50

For long-term memory:

ttl: '30d'
maxSessions: 200

Session Pruning

Prune old sessions regularly:

automation:
  sessionPruner:
    enabled: true
    interval: '1h'  # Run every hour

    # Prune sessions older than TTL
    pruneOlderThan: '7d'

Session Indexing

Optimize session search with indexes:

-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);

Database Performance

SQLite Configuration

Optimize SQLite for Flynn's workload:

# In SQLite connection setup
PRAGMA journal_mode = WAL;        -- Better concurrency
PRAGMA synchronous = NORMAL;      -- Faster writes
PRAGMA cache_size = -64000;       -- 64MB cache
PRAGMA temp_store = MEMORY;        -- Store temp data in memory
PRAGMA mmap_size = 268435456;     -- 256MB mmap
PRAGMA page_size = 4096;          -- Default page size

Connection Pooling

Flynn uses single SQLite connection per database. For high concurrency, consider:

// Future: Connection pool
import Database from 'better-sqlite3';

const pool = new ConnectionPool({
  filename: '/path/to/database.db',
  maxConnections: 10
});

Query Optimization

Use indexed columns in queries:

// Good: Uses index
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE last_active_at > ?
  ORDER BY last_active_at DESC
  LIMIT 10
`).all(threshold);

// Bad: Full table scan
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE message_count > ?
`).all(threshold);

Vacuum and Analyze

Regular maintenance improves performance:

# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"

# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"

# Rebuild indexes
sqlite3 sessions.db "REINDEX;"

Add to crontab (monthly):

0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1

Gateway Performance

Connection Limits

Limit concurrent connections:

gateway:
  enabled: true
  port: 18800

  # Maximum concurrent WebSocket connections
  maxConnections: 50

  # Single-client lock
  lock:
    enabled: true  # Only one client at a time

For multiple users:

gateway:
  maxConnections: 100
  lock:
    enabled: false

Lane Queue

The lane queue serializes requests per session:

gateway:
  laneQueue:
    # Max requests per session
    maxDepth: 10

    # Request timeout
    requestTimeoutMs: 30000

WebSocket Optimization

Configure WebSocket for performance:

// Gateway server WebSocket options
const wsOptions = {
  // Enable compression
  perMessageDeflate: {
    threshold: 1024
  },

  // Ping interval (heartbeat)
  clientTracking: true,

  // Maximum message size
  maxPayload: 16 * 1024 * 1024  // 16MB
};

HTTP Server

Optimize HTTP server for static files:

gateway:
  static:
    # Enable gzip compression
    gzip: true

    # Cache static assets
    cacheControl: 'public, max-age=3600'

    # Serve index.html for SPA routes
    spa: true

Resource Usage

Node.js Options

Tune Node.js for production:

# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"

# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"

In systemd service:

Environment="NODE_OPTIONS=--max-old-space-size=4096"

Process Limits

Set appropriate limits:

[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0

# CPU quota (200% = 2 cores)
CPUQuota=200%

# File descriptors
LimitNOFILE=65536

Docker Resource Limits

Constrain Docker container:

services:
  flynn:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 1G

Memory Monitoring

Monitor memory usage:

# Check Flynn memory
ps aux | grep flynn

# System memory
free -h

# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');

Monitoring & Profiling

Health Checks

Enable gateway health endpoint:

automation:
  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'

Check health:

curl http://localhost:18800/health

Logging Levels

Configure logging appropriately:

logging:
  level: 'info'  # debug | info | warn | error

Development: debug - All messages Production: info - Normal operation Minimal: warn - Only warnings and errors

Performance Metrics

Track key metrics:

// Future: Metrics collection
interface Metrics {
  // Response times
  avgResponseTime: number;
  p95ResponseTime: number;
  p99ResponseTime: number;

  // Throughput
  requestsPerSecond: number;
  concurrentSessions: number;

  // Token usage
  avgInputTokens: number;
  avgOutputTokens: number;
  totalTokens: number;

  // Errors
  errorRate: number;
  timeoutRate: number;
}

Profiling

Profile Node.js execution:

# Generate CPU profile
node --prof dist/cli/index.js start

# Process profile
node --prof-process isolate-*.log > profile.txt

# Analyze with Chrome DevTools
# Open chrome://inspect and load profile

Flamegraphs

Generate flamegraphs for bottleneck analysis:

# Install 0x
npm install -g 0x

# Run with profiler
0x dist/cli/index.js start

Common Performance Issues

High Memory Usage

Symptoms:

  • OOM errors
  • Slow garbage collection
  • System swapping

Solutions:

  1. Reduce keepTurns in compaction
  2. Decrease session TTL
  3. Prune old sessions
  4. Increase Node.js memory limit
  5. Check for memory leaks

Slow Response Times

Symptoms:

  • Responses > 10 seconds
  • Timeouts
  • Poor user experience

Solutions:

  1. Switch to faster model tier
  2. Enable compaction
  3. Use local fallbacks
  4. Optimize tool timeouts
  5. Check network latency

High CPU Usage

Symptoms:

  • CPU > 80%
  • Slow system
  • High latency

Solutions:

  1. Reduce concurrent sessions
  2. Optimize database queries
  3. Use efficient embeddings
  4. Disable unnecessary features
  5. Scale vertically (more CPU)

Database Locks

Symptoms:

  • SQLite database locked errors
  • Slow writes
  • Concurrent access issues

Solutions:

  1. Enable WAL mode
  2. Reduce write frequency
  3. Use connection pooling
  4. Add appropriate indexes

Model Rate Limits

Symptoms:

  • 429 Too Many Requests errors
  • Frequent fallbacks
  • Increased latency

Solutions:

  1. Configure retry with exponential backoff
  2. Use faster models for delegated tasks
  3. Implement request queuing
  4. Add local model fallbacks

Performance Checklist

Before deploying to production, verify:

  • Compaction configured with appropriate threshold
  • Model tiers configured for cost/latency
  • Fallback chains configured
  • Tool timeouts set appropriately
  • Session TTL reasonable for use case
  • SQLite optimized (WAL mode, cache size)
  • Database indexes created
  • Gateway connection limits set
  • Memory limits configured
  • Monitoring enabled
  • Logging level set to info or warn
  • Health checks working
  • Backup/restore tested

For more information: