flynn/docs/performance/TUNING.md

# Performance Tuning Guide

This guide covers performance optimization techniques for Flynn in production environments.

## Table of Contents

- [Overview](#overview)
- [Context Management](#context-management)
- [Model Routing](#model-routing)
- [Tool Execution](#tool-execution)
- [Memory & Embeddings](#memory--embeddings)
- [Session Management](#session-management)
- [Database Performance](#database-performance)
- [Gateway Performance](#gateway-performance)
- [Resource Usage](#resource-usage)
- [Monitoring & Profiling](#monitoring--profiling)

## Overview

Flynn's performance depends on several factors:

1. **Context window efficiency**: How efficiently tokens are used
2. **Model selection**: Choosing the right model for each task
3. **Tool execution**: Fast, reliable tool responses
4. **I/O operations**: Database and file system access
5. **Concurrency**: Handling multiple simultaneous requests

### Performance Goals

- **Response time**: < 5 seconds for simple queries
- **Context efficiency**: > 80% token utilization
- **Throughput**: 10-20 concurrent conversations
- **Resource usage**: < 2GB memory, < 50% CPU

## Context Management

### Compaction Settings

Context compaction prevents conversations from exceeding model context windows.

```yaml
agents:
  default:
    compaction:
      # Trigger compaction at 75% of context window
      thresholdPct: 75

      # Keep last 6 turns (user + assistant pairs)
      keepTurns: 6

      # Allow 2048 tokens for summary
      summaryMaxTokens: 2048

      # Preserve high-importance messages
      importanceThreshold: 0.8
```

### Tuning Guidelines

**For fast interactions:**
```yaml
thresholdPct: 60      # Compact early
keepTurns: 2          # Minimal history
summaryMaxTokens: 512  # Short summaries
```

**For complex reasoning:**
```yaml
thresholdPct: 85      # Maximize context
keepTurns: 10         # More history
summaryMaxTokens: 4096 # Detailed summaries
```

### Context Depth Levels

Control how much context is injected into the system prompt:

```yaml
prompt:
  contextDepth: 'normal'  # minimal | normal | detailed | debug
```

- `minimal`: Only basic system prompt
- `normal`: System prompt + basic memory
- `detailed`: Full memory + tool descriptions
- `debug`: Verbose context (development only)

### Token Counting

Flynn uses rule-based token estimation (fast but approximate).

**Enable tokenizer for accuracy (slower):**

```typescript
// Currently not implemented
// Future: Use tiktoken or similar for exact token counts
```

## Model Routing

### Tier Configuration

Optimize model tiers for cost and latency:

```yaml
models:
  router:
    tiers:
      # Fast, cheap: Quick tasks, delegated calls
      fast: 'anthropic:claude-haiku-4-20250514'

      # Default: General conversation
      default: 'anthropic:claude-sonnet-4-20250514'

      # Complex: Deep reasoning, analysis
      complex: 'anthropic:claude-opus-4-20250514'

      # Fallback: Local models when cloud fails
      local: 'ollama:llama3'
```

### Delegation Tasks

Map delegation tasks to appropriate tiers:

```yaml
agents:
  default:
    delegation:
      tiers:
        compaction: 'fast'           # Summarize history
        memoryExtraction: 'fast'     # Extract facts
        classification: 'default'     # Classify intent
        toolSummarization: 'default' # Summarize tool results
        complexReasoning: 'complex'  # Deep analysis
```

### Fallback Chains

Configure fallback chains for resilience:

```yaml
models:
  router:
    # Try same model on different provider
    tierFallbacks:
      default:
        - 'github:claude-sonnet-4-5'
        - 'openai:gpt-4o-mini'

    # Global fallback when all tiers fail
    fallbackChain:
      - 'github:claude-sonnet-4-5'
      - 'local:ollama:llama3'
```

### Retry Configuration

Optimize retry behavior for different scenarios:

```yaml
models:
  router:
    retry:
      # More retries for transient failures
      maxAttempts: 3

      # Start with 1s delay
      initialDelayMs: 1000

      # Exponential backoff
      multiplier: 2

      # Max 30s between retries
      maxDelayMs: 30000

      # Don't retry auth errors
      nonRetryablePatterns:
        - 'invalid_api_key'
        - 'permission_denied'
        - 'rate_limit_exceeded'
```

**For production reliability:**
```yaml
maxAttempts: 5
initialDelayMs: 500
multiplier: 1.5
maxDelayMs: 60000
```

### Cost Estimation

Monitor token usage and costs:

```typescript
// Model costs (examples)
const MODEL_COSTS = {
  'anthropic:claude-sonnet-4-20250514': {
    input: 3.0,    // $3 per 1M input tokens
    output: 15.0    // $15 per 1M output tokens
  },
  'anthropic:claude-haiku-4-20250514': {
    input: 0.25,
    output: 1.25
  }
};
```

Track usage with `AgentOrchestrator.getUsageStats()`.

## Tool Execution

### Timeout Configuration

Set appropriate timeouts for different tool types:

```yaml
tools:
  executor:
    # Default 30s timeout
    defaultTimeoutMs: 30000

    # Max 50KB output
    maxOutputBytes: 51200
```

**For long-running tools:**
```yaml
tools:
  executor:
    defaultTimeoutMs: 60000  # 60s
```

**For fast tools:**
```yaml
tools:
  executor:
    defaultTimeoutMs: 10000  # 10s
```

### Caching (Future)

Implement caching for repeated operations:

```yaml
# Not yet implemented
tools:
  cache:
    enabled: true
    ttl: 300  # 5 minutes
    maxSize: 1000
    excludePatterns:
      - 'shell.exec'
      - 'process.*'
```

### Sandbox Performance

Docker sandbox adds overhead. Optimize:

```yaml
sandbox:
  enabled: true
  image: 'node:22-alpine'

  # Resource limits
  resourceLimits:
    memory: '512m'
    cpus: '0.5'
    timeoutSec: 60

  # Use host networking if safe
  networkMode: 'host'  # Faster than bridge mode
```

**For best performance:**
```yaml
sandbox:
  enabled: false  # Disable if not needed
```

### Parallel Tool Execution

Flynn executes tools sequentially. For parallel execution:

```typescript
// Future enhancement
const results = await Promise.all([
  toolRegistry.execute('tool1', args1),
  toolRegistry.execute('tool2', args2),
  toolRegistry.execute('tool3', args3)
]);
```

## Memory & Embeddings

### Embedding Provider Selection

Choose embedding provider based on latency and cost:

```yaml
memory:
  embeddings:
    provider: 'openai'  # openai | gemini | ollama | llamacpp | voyage

    openai:
      apiKey: '${OPENAI_API_KEY}'
      model: 'text-embedding-3-small'  # Fastest

    # Alternative: Local embeddings
    ollama:
      host: 'localhost:11434'
      model: 'nomic-embed-text'
```

**Latency comparison:**
- OpenAI `text-embedding-3-small`: ~100ms
- Gemini: ~200ms
- Ollama `nomic-embed-text`: ~500ms (local)
- llama.cpp: ~300ms (local)

### Text Chunking

Optimize chunking for better search:

```yaml
memory:
  embeddings:
    chunking:
      # Smaller chunks for precision
      maxChunkSize: 512

      # Overlap for context preservation
      chunkOverlap: 50

      # Don't chunk small documents
      minChunkSize: 128
```

**For fast indexing:**
```yaml
maxChunkSize: 1024
chunkOverlap: 100
```

**For precise search:**
```yaml
maxChunkSize: 256
chunkOverlap: 25
```

### Hybrid Search Tuning

Balance keyword and vector search:

```yaml
memory:
  search:
    # Weight vector search higher
    vectorWeight: 0.7
    keywordWeight: 0.3

    # Return top results
    limit: 10

    # Minimum relevance threshold
    threshold: 0.5
```

**For keyword-heavy queries:**
```yaml
vectorWeight: 0.4
keywordWeight: 0.6
```

**For semantic queries:**
```yaml
vectorWeight: 0.8
keywordWeight: 0.2
```

### Embedding Caching

Cache embeddings to avoid recomputation:

```yaml
memory:
  embeddings:
    cache:
      enabled: true
      ttl: 86400  # 24 hours
```

## Session Management

### TTL Configuration

Set appropriate session TTLs:

```yaml
sessions:
  ttl: '7d'  # Keep sessions for 7 days

  # Maximum concurrent sessions
  maxSessions: 100
```

**For memory efficiency:**
```yaml
ttl: '1d'
maxSessions: 50
```

**For long-term memory:**
```yaml
ttl: '30d'
maxSessions: 200
```

### Session Pruning

Prune old sessions regularly:

```yaml
automation:
  sessionPruner:
    enabled: true
    interval: '1h'  # Run every hour

    # Prune sessions older than TTL
    pruneOlderThan: '7d'
```

### Session Indexing

Optimize session search with indexes:

```sql
-- SQLite indexes
CREATE INDEX idx_sessions_created_at ON sessions(created_at);
CREATE INDEX idx_sessions_last_active ON sessions(last_active_at);
CREATE INDEX idx_messages_session_id ON messages(session_id);
```

## Database Performance

### SQLite Configuration

Optimize SQLite for Flynn's workload:

```bash
# In SQLite connection setup
PRAGMA journal_mode = WAL;        -- Better concurrency
PRAGMA synchronous = NORMAL;      -- Faster writes
PRAGMA cache_size = -64000;       -- 64MB cache
PRAGMA temp_store = MEMORY;        -- Store temp data in memory
PRAGMA mmap_size = 268435456;     -- 256MB mmap
PRAGMA page_size = 4096;          -- Default page size
```

### Connection Pooling

Flynn uses single SQLite connection per database. For high concurrency, consider:

```typescript
// Future: Connection pool
import Database from 'better-sqlite3';

const pool = new ConnectionPool({
  filename: '/path/to/database.db',
  maxConnections: 10
});
```

### Query Optimization

Use indexed columns in queries:

```typescript
// Good: Uses index
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE last_active_at > ?
  ORDER BY last_active_at DESC
  LIMIT 10
`).all(threshold);

// Bad: Full table scan
const sessions = db.prepare(`
  SELECT * FROM sessions
  WHERE message_count > ?
`).all(threshold);
```

### Vacuum and Analyze

Regular maintenance improves performance:

```bash
# Vacuum to reclaim space
sqlite3 sessions.db "VACUUM;"

# Analyze for query optimization
sqlite3 sessions.db "ANALYZE;"

# Rebuild indexes
sqlite3 sessions.db "REINDEX;"
```

Add to crontab (monthly):
```
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM; ANALYZE;" >> /var/log/flynn-maintenance.log 2>&1
```

## Gateway Performance

### Connection Limits

Limit concurrent connections:

```yaml
gateway:
  enabled: true
  port: 18800

  # Maximum concurrent WebSocket connections
  maxConnections: 50

  # Single-client lock
  lock:
    enabled: true  # Only one client at a time
```

**For multiple users:**
```yaml
gateway:
  maxConnections: 100
  lock:
    enabled: false
```

### Lane Queue

The lane queue serializes requests per session:

```yaml
gateway:
  laneQueue:
    # Max requests per session
    maxDepth: 10

    # Request timeout
    requestTimeoutMs: 30000
```

### WebSocket Optimization

Configure WebSocket for performance:

```typescript
// Gateway server WebSocket options
const wsOptions = {
  // Enable compression
  perMessageDeflate: {
    threshold: 1024
  },

  // Ping interval (heartbeat)
  clientTracking: true,

  // Maximum message size
  maxPayload: 16 * 1024 * 1024  // 16MB
};
```

### HTTP Server

Optimize HTTP server for static files:

```yaml
gateway:
  static:
    # Enable gzip compression
    gzip: true

    # Cache static assets
    cacheControl: 'public, max-age=3600'

    # Serve index.html for SPA routes
    spa: true
```

## Resource Usage

### Node.js Options

Tune Node.js for production:

```bash
# Increase memory limit
export NODE_OPTIONS="--max-old-space-size=4096"

# Enable optimizations
export NODE_OPTIONS="--max-old-space-size=4096 --optimize-for-size --gc-interval=100"
```

In systemd service:
```ini
Environment="NODE_OPTIONS=--max-old-space-size=4096"
```

### Process Limits

Set appropriate limits:

```ini
[Service]
# Memory limit (2GB)
MemoryLimit=2G
MemorySwap=0

# CPU quota (200% = 2 cores)
CPUQuota=200%

# File descriptors
LimitNOFILE=65536
```

### Docker Resource Limits

Constrain Docker container:

```yaml
services:
  flynn:
    deploy:
      resources:
        limits:
          cpus: '2.0'
          memory: 2G
        reservations:
          cpus: '1.0'
          memory: 1G
```

### Memory Monitoring

Monitor memory usage:

```bash
# Check Flynn memory
ps aux | grep flynn

# System memory
free -h

# Node.js heap stats (add to code)
console.log('Heap used:', process.memoryUsage().heapUsed / 1024 / 1024, 'MB');
```

## Monitoring & Profiling

### Health Checks

Enable gateway health endpoint:

```yaml
automation:
  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'
```

Check health:
```bash
curl http://localhost:18800/health
```

### Logging Levels

Configure logging appropriately:

```yaml
logging:
  level: 'info'  # debug | info | warn | error
```

**Development:** `debug` - All messages
**Production:** `info` - Normal operation
**Minimal:** `warn` - Only warnings and errors

### Performance Metrics

Track key metrics:

```typescript
// Future: Metrics collection
interface Metrics {
  // Response times
  avgResponseTime: number;
  p95ResponseTime: number;
  p99ResponseTime: number;

  // Throughput
  requestsPerSecond: number;
  concurrentSessions: number;

  // Token usage
  avgInputTokens: number;
  avgOutputTokens: number;
  totalTokens: number;

  // Errors
  errorRate: number;
  timeoutRate: number;
}
```

### Profiling

Profile Node.js execution:

```bash
# Generate CPU profile
node --prof dist/cli/index.js start

# Process profile
node --prof-process isolate-*.log > profile.txt

# Analyze with Chrome DevTools
# Open chrome://inspect and load profile
```

### Flamegraphs

Generate flamegraphs for bottleneck analysis:

```bash
# Install 0x
npm install -g 0x

# Run with profiler
0x dist/cli/index.js start
```

## Common Performance Issues

### High Memory Usage

**Symptoms:**
- OOM errors
- Slow garbage collection
- System swapping

**Solutions:**
1. Reduce `keepTurns` in compaction
2. Decrease session TTL
3. Prune old sessions
4. Increase Node.js memory limit
5. Check for memory leaks

### Slow Response Times

**Symptoms:**
- Responses > 10 seconds
- Timeouts
- Poor user experience

**Solutions:**
1. Switch to faster model tier
2. Enable compaction
3. Use local fallbacks
4. Optimize tool timeouts
5. Check network latency

### High CPU Usage

**Symptoms:**
- CPU > 80%
- Slow system
- High latency

**Solutions:**
1. Reduce concurrent sessions
2. Optimize database queries
3. Use efficient embeddings
4. Disable unnecessary features
5. Scale vertically (more CPU)

### Database Locks

**Symptoms:**
- SQLite database locked errors
- Slow writes
- Concurrent access issues

**Solutions:**
1. Enable WAL mode
2. Reduce write frequency
3. Use connection pooling
4. Add appropriate indexes

### Model Rate Limits

**Symptoms:**
- 429 Too Many Requests errors
- Frequent fallbacks
- Increased latency

**Solutions:**
1. Configure retry with exponential backoff
2. Use faster models for delegated tasks
3. Implement request queuing
4. Add local model fallbacks

## Performance Checklist

Before deploying to production, verify:

- [ ] Compaction configured with appropriate threshold
- [ ] Model tiers configured for cost/latency
- [ ] Fallback chains configured
- [ ] Tool timeouts set appropriately
- [ ] Session TTL reasonable for use case
- [ ] SQLite optimized (WAL mode, cache size)
- [ ] Database indexes created
- [ ] Gateway connection limits set
- [ ] Memory limits configured
- [ ] Monitoring enabled
- [ ] Logging level set to `info` or `warn`
- [ ] Health checks working
- [ ] Backup/restore tested

---

For more information:
- [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md)
- [PRODUCTION.md](../deployment/PRODUCTION.md)
- [ARCHITECTURE.md](../../.planning/codebase/ARCHITECTURE.md)