will/flynn

Fork 0

Files

T

William Valentin 9c8da41610 docs: add proactive context usage and compaction guidance

2026-02-16 15:44:13 -08:00

16 KiB

Raw Permalink Blame History

Performance Tuning Guide

This guide covers performance optimization techniques for Flynn in production environments.

Overview
Context Management
Model Routing
Tool Execution
Memory & Embeddings
Session Management
Database Performance
Gateway Performance
Resource Usage
Monitoring & Profiling

Overview

Flynn's performance depends on several factors:

Context window efficiency: How efficiently tokens are used
Model selection: Choosing the right model for each task
Tool execution: Fast, reliable tool responses
I/O operations: Database and file system access
Concurrency: Handling multiple simultaneous requests

Performance Goals

Response time: < 5 seconds for simple queries
Context efficiency: > 80% token utilization
Throughput: 10-20 concurrent conversations
Resource usage: < 2GB memory, < 50% CPU

Context Management

Compaction Settings

Context compaction prevents conversations from exceeding model context windows.

agents:
  default:
    compaction:
      # Trigger compaction at 75% of context window
      thresholdPct: 75

      # Keep last 6 turns (user + assistant pairs)
      keepTurns: 6

      # Allow 2048 tokens for summary
      summaryMaxTokens: 2048

      # Preserve high-importance messages
      importanceThreshold: 0.8

Tuning Guidelines

For fast interactions:

thresholdPct: 60      # Compact early
keepTurns: 2          # Minimal history
summaryMaxTokens: 512  # Short summaries

For complex reasoning:

thresholdPct: 85      # Maximize context
keepTurns: 10         # More history
summaryMaxTokens: 4096 # Detailed summaries

Proactive Compaction Signals

Use proactive thresholds to checkpoint before compaction cliffs and emit warning telemetry:

compaction:
  proactive:
    enabled: true
    warn_pct: 75
    checkpoint_pct: 85
    auto_compact_pct: 95
    checkpoint_cooldown_ms: 300000
    memory_namespace: session/checkpoints

Context Depth Levels

Control how much context is injected into the system prompt:

prompt:
  contextDepth: 'normal'  # minimal | normal | detailed | debug

minimal: Only basic system prompt
normal: System prompt + basic memory
detailed: Full memory + tool descriptions
debug: Verbose context (development only)

Token Counting

Flynn uses rule-based token estimation (fast but approximate).

Enable tokenizer for accuracy (slower):

// Currently not implemented
// Future: Use tiktoken or similar for exact token counts

Model Routing

Tier Configuration

Optimize model tiers for cost and latency:

models:
  router:
    tiers:
      # Fast, cheap: Quick tasks, delegated calls
      fast: 'anthropic:claude-haiku-4-20250514'

      # Default: General conversation
      default: 'anthropic:claude-sonnet-4-20250514'

      # Complex: Deep reasoning, analysis
      complex: 'anthropic:claude-opus-4-20250514'

      # Fallback: Local models when cloud fails
      local: 'ollama:llama3'

Delegation Tasks

Map delegation tasks to appropriate tiers:

agents:
  default:
    delegation:
      tiers:
        compaction: 'fast'           # Summarize history
        memoryExtraction: 'fast'     # Extract facts
        classification: 'default'     # Classify intent
        toolSummarization: 'default' # Summarize tool results
        complexReasoning: 'complex'  # Deep analysis

Fallback Chains

Configure fallback chains for resilience:

models:
  router:
    # Try same model on different provider
    tierFallbacks:
      default:
        - 'github:claude-sonnet-4-5'
        - 'openai:gpt-4o-mini'

    # Global fallback when all tiers fail
    fallbackChain:
      - 'github:claude-sonnet-4-5'
      - 'local:ollama:llama3'

Retry Configuration

Optimize retry behavior for different scenarios:

models:
  router:
    retry:
      # More retries for transient failures
      maxAttempts: 3

      # Start with 1s delay
      initialDelayMs: 1000

      # Exponential backoff
      multiplier: 2

      # Max 30s between retries
      maxDelayMs: 30000

      # Don't retry auth errors
      nonRetryablePatterns:
        - 'invalid_api_key'
        - 'permission_denied'
        - 'rate_limit_exceeded'

For production reliability:

maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000

Cost Estimation

Monitor token usage and costs:

// Model costs (examples)
const MODEL_COSTS = {
  'anthropic:claude-sonnet-4-20250514': {
    input: 3.0,    // $3 per 1M input tokens
    output: 15.0    // $15 per 1M output tokens
  },
  'anthropic:claude-haiku-4-20250514': {
    input: 0.25,
    output: 1.25
  }
};

Track usage with AgentOrchestrator.getUsageStats().

Tool Execution

Timeout Configuration

Set appropriate timeouts for different tool types:

tools:
  executor:
    # Default 30s timeout
    defaultTimeoutMs: 30000

    # Max 50KB output
    maxOutputBytes: 51200

For long-running tools:

tools:
  executor:
    defaultTimeoutMs: 60000  # 60s

For fast tools:

tools:
  executor:
    defaultTimeoutMs: 10000  # 10s

Caching (Future)

Implement caching for repeated operations:

# Not yet implemented
tools:
  cache:
    enabled: true
    ttl: 300  # 5 minutes
    maxSize: 1000
    excludePatterns:
      - 'shell.exec'
      - 'process.*'

Sandbox Performance

Docker sandbox adds overhead. Optimize:

sandbox:
  enabled: true
  image: 'node:22-alpine'

  # Resource limits
  resourceLimits:
    memory: '512m'
    cpus: '0.5'
    timeoutSec: 60

  # Use host networking if safe
  networkMode: 'host'  # Faster than bridge mode

For best performance:

sandbox:
  enabled: false  # Disable if not needed

Parallel Tool Execution

Flynn executes tools sequentially. For parallel execution:

// Future enhancement
const results = await Promise.all([
  toolRegistry.execute('tool1', args1),
  toolRegistry.execute('tool2', args2),
  toolRegistry.execute('tool3', args3)
]);

Memory & Embeddings

Embedding Provider Selection

Choose embedding provider based on latency and cost:

memory:
  embeddings:
    provider: 'openai'  # openai | gemini | ollama | llamacpp | voyage

    openai:
      apiKey: '${OPENAI_API_KEY}'
      model: 'text-embedding-3-small'  # Fastest

    # Alternative: Local embeddings
    ollama:
      host: 'localhost:11434'
      model: 'nomic-embed-text'

Latency comparison:

OpenAI text-embedding-3-small: ~100ms
Gemini: ~200ms
Ollama nomic-embed-text: ~500ms (local)
llama.cpp: ~300ms (local)

Text Chunking

Optimize chunking for better search:

memory:
  embeddings:
    chunking:
      # Smaller chunks for precision
      maxChunkSize: 512

      # Overlap for context preservation
      chunkOverlap: 50

      # Don't chunk small documents
      minChunkSize: 128

For fast indexing:

maxChunkSize: 1024
chunkOverlap: 100

For precise search:

maxChunkSize: 256
chunkOverlap: 25

Hybrid Search Tuning

Balance keyword and vector search:

memory:
  search:
    # Weight vector search higher
    vectorWeight: 0.7
    keywordWeight: 0.3

    # Return top results
    limit: 10

    # Minimum relevance threshold
    threshold: 0.5

For keyword-heavy queries:

vectorWeight: 0.4
keywordWeight: 0.6

For semantic queries:

vectorWeight: 0.8
keywordWeight: 0.2

Embedding Caching

Cache embeddings to avoid recomputation:

memory:
  embeddings:
    cache:
      enabled: true
      ttl: 86400  # 24 hours

Session Management

TTL Configuration

Set appropriate session TTLs:

sessions:
  ttl: '7d'  # Keep sessions for 7 days

  # Maximum concurrent sessions
  maxSessions: 100

For memory efficiency:

ttl: '1d'
maxSessions: 50

For long-term memory:

ttl: '30d'
maxSessions: 200

Session Pruning

Prune old sessions regularly:

automation:
  sessionPruner:
    enabled: true
    interval: '1h'  # Run every hour

    # Prune sessions older than TTL
    pruneOlderThan: '7d'

Session Indexing

Optimize session search with indexes:

-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);

Database Performance

SQLite Configuration

Optimize SQLite for Flynn's workload:

# In SQLite connection setup
PRAGMA journal_mode = WAL;        -- Better concurrency
PRAGMA synchronous = NORMAL;      -- Faster writes
PRAGMA cache_size = -64000;       -- 64MB cache
PRAGMA temp_store = MEMORY;        -- Store temp data in memory
PRAGMA mmap_size = 268435456;     -- 256MB mmap
PRAGMA page_size = 4096;          -- Default page size

Connection Pooling

Flynn uses single SQLite connection per database. For high concurrency, consider:

// Future: Connection pool
import Database from 'better-sqlite3';

const pool = new ConnectionPool({
  filename: '/path/to/database.db',
  maxConnections: 10
});

Query Optimization

Use indexed columns in queries:

// Good: Uses index
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE last_active_at > ?
  ORDER BY last_active_at DESC
  LIMIT 10
`).all(threshold);

// Bad: Full table scan
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE message_count > ?
`).all(threshold);

Vacuum and Analyze

Regular maintenance improves performance:

# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"

# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"

# Rebuild indexes
sqlite3 sessions.db "REINDEX;"

Add to crontab (monthly):

0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1

Gateway Performance

Connection Limits

Limit concurrent connections:

gateway:
  enabled: true
  port: 18800

  # Maximum concurrent WebSocket connections
  maxConnections: 50

  # Single-client lock
  lock:
    enabled: true  # Only one client at a time

For multiple users:

gateway:
  maxConnections: 100
  lock:
    enabled: false

Lane Queue

The lane queue serializes requests per session:

gateway:
  laneQueue:
    # Max requests per session
    maxDepth: 10

    # Request timeout
    requestTimeoutMs: 30000

WebSocket Optimization

Configure WebSocket for performance:

// Gateway server WebSocket options
const wsOptions = {
  // Enable compression
  perMessageDeflate: {
    threshold: 1024
  },

  // Ping interval (heartbeat)
  clientTracking: true,

  // Maximum message size
  maxPayload: 16 * 1024 * 1024  // 16MB
};

HTTP Server

Optimize HTTP server for static files:

gateway:
  static:
    # Enable gzip compression
    gzip: true

    # Cache static assets
    cacheControl: 'public, max-age=3600'

    # Serve index.html for SPA routes
    spa: true

Resource Usage

Node.js Options

Tune Node.js for production:

# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"

# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"

In systemd service:

Environment="NODE_OPTIONS=--max-old-space-size=4096"

Process Limits

Set appropriate limits:

[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0

# CPU quota (200% = 2 cores)
CPUQuota=200%

# File descriptors
LimitNOFILE=65536

Docker Resource Limits

Constrain Docker container:

services:
  flynn:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 1G

Memory Monitoring

Monitor memory usage:

# Check Flynn memory
ps aux | grep flynn

# System memory
free -h

# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');

Monitoring & Profiling

Health Checks

Enable gateway health endpoint:

automation:
  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'

Check health:

curl http://localhost:18800/health

Logging Levels

Configure logging appropriately:

logging:
  level: 'info'  # debug | info | warn | error

Development: debug - All messages Production: info - Normal operation Minimal: warn - Only warnings and errors

Performance Metrics

Track key metrics:

// Future: Metrics collection
interface Metrics {
  // Response times
  avgResponseTime: number;
  p95ResponseTime: number;
  p99ResponseTime: number;

  // Throughput
  requestsPerSecond: number;
  concurrentSessions: number;

  // Token usage
  avgInputTokens: number;
  avgOutputTokens: number;
  totalTokens: number;

  // Errors
  errorRate: number;
  timeoutRate: number;
}

Profiling

Profile Node.js execution:

# Generate CPU profile
node --prof dist/cli/index.js start

# Process profile
node --prof-process isolate-*.log > profile.txt

# Analyze with Chrome DevTools
# Open chrome://inspect and load profile

Flamegraphs

Generate flamegraphs for bottleneck analysis:

# Install 0x
npm install -g 0x

# Run with profiler
0x dist/cli/index.js start

Common Performance Issues

High Memory Usage

Symptoms:

OOM errors
Slow garbage collection
System swapping

Solutions:

Reduce keepTurns in compaction
Decrease session TTL
Prune old sessions
Increase Node.js memory limit
Check for memory leaks

Slow Response Times

Symptoms:

Responses > 10 seconds
Timeouts
Poor user experience

Solutions:

Switch to faster model tier
Enable compaction
Use local fallbacks
Optimize tool timeouts
Check network latency

High CPU Usage

Symptoms:

CPU > 80%
Slow system
High latency

Solutions:

Reduce concurrent sessions
Optimize database queries
Use efficient embeddings
Disable unnecessary features
Scale vertically (more CPU)

Database Locks

Symptoms:

SQLite database locked errors
Slow writes
Concurrent access issues

Solutions:

Enable WAL mode
Reduce write frequency
Use connection pooling
Add appropriate indexes

Model Rate Limits

Symptoms:

429 Too Many Requests errors
Frequent fallbacks
Increased latency

Solutions:

Configure retry with exponential backoff
Use faster models for delegated tasks
Implement request queuing
Add local model fallbacks

Performance Checklist

Before deploying to production, verify:

Compaction configured with appropriate threshold
Model tiers configured for cost/latency
Fallback chains configured
Tool timeouts set appropriately
Session TTL reasonable for use case
SQLite optimized (WAL mode, cache size)
Database indexes created
Gateway connection limits set
Memory limits configured
Monitoring enabled
Logging level set to info or warn
Health checks working
Backup/restore tested

For more information:

16 KiB Raw Permalink Blame History

Performance Tuning Guide

Table of Contents

Overview

Performance Goals

Context Management

Compaction Settings

Tuning Guidelines

Proactive Compaction Signals

Context Depth Levels

Token Counting

Model Routing

Tier Configuration

Delegation Tasks

Fallback Chains

Retry Configuration

Cost Estimation

Tool Execution

Timeout Configuration

Caching (Future)

Sandbox Performance

Parallel Tool Execution

Memory & Embeddings

Embedding Provider Selection

Text Chunking

Hybrid Search Tuning

Embedding Caching

Session Management

TTL Configuration

Session Pruning

Session Indexing

Database Performance

SQLite Configuration

Connection Pooling

Query Optimization

Vacuum and Analyze

Gateway Performance

Connection Limits

Lane Queue

WebSocket Optimization

HTTP Server

Resource Usage

Node.js Options

Process Limits

Docker Resource Limits

Memory Monitoring

Monitoring & Profiling

Health Checks

Logging Levels

Performance Metrics

Profiling

Flamegraphs

Common Performance Issues

High Memory Usage

Slow Response Times

High CPU Usage

Database Locks

Model Rate Limits

Performance Checklist

16 KiB

Raw Permalink Blame History