flynn/docs/deployment/PRODUCTION.md

# Production Deployment Guide

This guide covers deploying Flynn in a production environment.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Docker Deployment](#docker-deployment)
- [Systemd Service](#systemd-service)
- [Security](#security)
- [Configuration](#configuration)
- [Monitoring](#monitoring)
- [Backup & Recovery](#backup--recovery)
- [Performance Tuning](#performance-tuning)
- [Scaling Considerations](#scaling-considerations)

## Prerequisites

### System Requirements

- **OS**: Linux (Ubuntu 22.04+ recommended) or macOS
- **Node.js**: >= 22.0.0
- **Memory**: Minimum 2GB, 4GB+ recommended
- **Disk**: 10GB+ for sessions, memory, and vectors
- **Docker**: Required for sandbox features (optional)

### Network Requirements

- Public IP or VPN (Tailscale recommended) for remote access
- Open ports: 18800 (gateway), optional 443 (Tailscale Serve)
- Outbound HTTPS access for model providers and web tools

### External Services (Optional)

- **Model Providers**: Anthropic, OpenAI, GitHub Models, etc. (API keys required)
- **Email**: SMTP server for email notifications
- **Object Storage**: MinIO or S3 for backups (optional)

## Docker Deployment

### Quick Start

Using the provided `docker-compose.yml`:

```bash
# Clone repository
git clone <repo-url>
cd flynn

# Create config
cp config/default.yaml config/production.yaml
# Edit config/production.yaml with your settings

# Start services
docker-compose up -d

# View logs
docker-compose logs -f
```

### Dockerfile

The multi-stage Dockerfile:

```dockerfile
# Stage 1: Build
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Stage 2: Runtime
FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY config ./config
COPY src/gateway/ui ./dist/gateway/ui

# Create data directory
RUN mkdir -p /data

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD node -e "require('http').get('http://localhost:18800/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

# Expose gateway port
EXPOSE 18800

# Run
CMD ["node", "dist/cli/index.js", "start"]
```

### Docker Compose Configuration

```yaml
version: '3.8'

services:
  flynn:
    build: .
    container_name: flynn
    restart: unless-stopped
    ports:
      - "18800:18800"
    volumes:
      - ./config/production.yaml:/flynn/config.yaml:ro
      - flynn_data:/data
      - /var/run/docker.sock:/var/run/docker.sock  # For sandbox
    environment:
      - NODE_ENV=production
      - FLYNN_CONFIG=/flynn/config.yaml
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:18800/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5s

  whisper:
    image: openai/whisper-server:latest
    container_name: whisper-server
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - whisper_cache:/cache
    environment:
      - WHISPER_MODEL=base
      - WHISPER_HTTP_PORT=8080

volumes:
  flynn_data:
  whisper_cache:
```

### Environment Variables

```bash
# Node environment
export NODE_ENV=production

# Config path
export FLYNN_CONFIG=/path/to/config.yaml

# Data directory (default: ~/.local/share/flynn)
export FLYNN_DATA_DIR=/var/lib/flynn

# Optional: Override model provider credentials
export ANTHROPIC_API_KEY=sk-...
export OPENAI_API_KEY=sk-...
```

## Systemd Service

### Service File

Create `/etc/systemd/system/flynn.service`:

```ini
[Unit]
Description=Flynn AI Assistant Daemon
After=network.target
Wants=network-online.target

[Service]
Type=simple
User=flynn
Group=flynn
WorkingDirectory=/opt/flynn
Environment="NODE_ENV=production"
Environment="FLYNN_CONFIG=/etc/flynn/config.yaml"
Environment="FLYNN_DATA_DIR=/var/lib/flynn"
ExecStart=/usr/local/bin/node /opt/flynn/dist/cli/index.js start
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=flynn

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/flynn /var/log/flynn /var/run

# Resource limits
MemoryLimit=2G
MemorySwap=0
CPUQuota=200%

[Install]
WantedBy=multi-user.target
```

### Create Flynn User

```bash
# Create user and group
sudo useradd --system --home /var/lib/flynn --shell /usr/sbin/nologin flynn
sudo groupadd flynn

# Create directories
sudo mkdir -p /opt/flynn /etc/flynn /var/lib/flynn /var/log/flynn
sudo chown -R flynn:flynn /opt/flynn /var/lib/flynn /var/log/flynn

# Copy binaries and config
sudo cp -r dist/* /opt/flynn/
sudo cp config/production.yaml /etc/flynn/config.yaml
sudo chown -R root:root /opt/flynn /etc/flynn
sudo chmod 644 /etc/flynn/config.yaml
```

### Enable and Start Service

```bash
# Reload systemd
sudo systemctl daemon-reload

# Enable service (start on boot)
sudo systemctl enable flynn

# Start service
sudo systemctl start flynn

# Check status
sudo systemctl status flynn

# View logs
sudo journalctl -u flynn -f

# Restart service
sudo systemctl restart flynn
```

### Service Management

```bash
# Stop service
sudo systemctl stop flynn

# Reload config (requires restart)
sudo systemctl restart flynn

# Check if running
sudo systemctl is-active flynn

# View recent logs
sudo journalctl -u flynn -n 100 --no-pager
```

## Security

### Secrets Management

Never commit secrets to version control. Use one of these approaches:

#### Environment Variables

```yaml
# config/production.yaml
models:
  default:
    anthropic:
      apiKey: '${ANTHROPIC_API_KEY}'
```

Set in `/etc/flynn/.env` or systemd service file:
```ini
Environment="ANTHROPIC_API_KEY=sk-..."
```

#### HashiCorp Vault (Advanced)

Use a secrets manager and inject at runtime:

```bash
vault kv get -field=api_key secret/anthropic > /tmp/anthropic_key.txt
export ANTHROPIC_API_KEY=$(cat /tmp/anthropic_key.txt)
rm /tmp/anthropic_key.txt
```

### Authentication

#### Gateway Auth

```yaml
# config/production.yaml
gateway:
  enabled: true
  auth:
    token: 'your-random-token-here'  # Generate with: openssl rand -hex 32
    trustTailscaleIdentity: true
    applyToHttp: true
```

Generate a secure token:
```bash
openssl rand -hex 32
```

#### Channel Whitelists

Restrict who can interact with Flynn:

```yaml
channels:
  telegram:
    allowedChatIds: ['123456789']  # Your Telegram chat ID
  discord:
    allowedGuildIds: ['987654321098765432']
    allowedChannelIds: ['123456789012345678']
  slack:
    allowedChannelIds: ['C12345678']
    signingSecret: '${SLACK_SIGNING_SECRET}'
```

### Network Security

#### Firewall

```bash
# Ubuntu/Debian (ufw)
sudo ufw allow 22/tcp    # SSH
sudo ufw allow 18800/tcp  # Flynn gateway
sudo ufw enable

# CentOS/RHEL (firewalld)
sudo firewall-cmd --permanent --add-port=18800/tcp
sudo firewall-cmd --reload
```

#### Reverse Proxy (Nginx)

Place Flynn behind Nginx for TLS:

```nginx
server {
    listen 443 ssl http2;
    server_name flynn.example.com;

    ssl_certificate /etc/letsencrypt/live/flynn.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/flynn.example.com/privkey.pem;

    # WebSocket upgrade
    location / {
        proxy_pass http://localhost:18800;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Timeouts
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
    }

    # Health check endpoint (no auth required)
    location /health {
        proxy_pass http://localhost:18800/health;
        access_log off;
    }
}
```

Obtain TLS certificate with Let's Encrypt:
```bash
sudo certbot --nginx -d flynn.example.com
```

### File Permissions

```bash
# Data directory
sudo chmod 750 /var/lib/flynn
sudo chown flynn:flynn /var/lib/flynn

# Config file
sudo chmod 640 /etc/flynn/config.yaml
sudo chown root:flynn /etc/flynn/config.yaml

# Logs
sudo chmod 750 /var/log/flynn
sudo chown flynn:flynn /var/log/flynn
```

### Sandbox Security

Docker sandbox adds isolation but requires careful configuration:

```yaml
# config/production.yaml
sandbox:
  enabled: true
  image: 'node:22-alpine'
  dockerSocket: '/var/run/docker.sock'
  resourceLimits:
    memory: '512m'
    cpus: '0.5'
    timeoutSec: 60
  networkMode: 'none'  # No network access
```

Ensure Docker is secured:
```bash
# Run Docker as Flynn user
sudo usermod -aG docker flynn

# Configure Docker daemon security
sudo vim /etc/docker/daemon.json
```

```json
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  },
  "live-restore": true,
  "userland-proxy": false
}
```

## Configuration

### Production Config Template

```yaml
# config/production.yaml
# Base config for production deployment

# ── Gateway ───────────────────────────────────────────────────────────────
gateway:
  enabled: true
  port: 18800
  auth:
    token: '${GATEWAY_TOKEN}'
    trustTailscaleIdentity: true
    applyToHttp: true
  lock:
    enabled: true
  tailscaleServe:
    enabled: false  # Set to true to expose via Tailscale
    hostname: 'flynn'
    port: 443

# ── Models ─────────────────────────────────────────────────────────────────
models:
  default:
    anthropic:
      apiKey: '${ANTHROPIC_API_KEY}'
      model: 'claude-sonnet-4-20250514'
      maxTokens: 4096

  router:
    tiers:
      default: 'anthropic:claude-sonnet-4-20250514'
      fast: 'anthropic:claude-haiku-4-20250514'
      complex: 'anthropic:claude-opus-4-20250514'
      local: 'ollama:llama3'

    fallbackChain:
      - 'github:claude-sonnet-4-5'
      - 'local:ollama:llama3'

    retry:
      maxAttempts: 3
      initialDelayMs: 1000
      multiplier: 2
      maxDelayMs: 30000

# ── Channels ───────────────────────────────────────────────────────────────
channels:
  telegram:
    enabled: true
    token: '${TELEGRAM_BOT_TOKEN}'
    allowedChatIds: ['123456789']

  discord:
    enabled: false

  slack:
    enabled: false

  whatsapp:
    enabled: false

# ── Sessions ───────────────────────────────────────────────────────────────
sessions:
  ttl: '7d'
  maxSessions: 100

# ── Memory ────────────────────────────────────────────────────────────────
memory:
  enabled: true
  embeddings:
    provider: 'openai'
    openai:
      apiKey: '${OPENAI_API_KEY}'
      model: 'text-embedding-3-small'

# ── Tools ─────────────────────────────────────────────────────────────────
tools:
  policy: 'coding'  # Restrict tool access

  executor:
    defaultTimeoutMs: 30000
    maxOutputBytes: 51200

  sandbox:
    enabled: false  # Enable if using Docker

# ── Agents ────────────────────────────────────────────────────────────────
agents:
  default:
    modelTier: 'default'
    toolPolicy: 'coding'
    compaction:
      thresholdPct: 80
      keepTurns: 4
      summaryMaxTokens: 1024

# ── Automation ────────────────────────────────────────────────────────────
automation:
  cron:
    enabled: false

  webhooks:
    enabled: false

  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'
    notifications:
      - type: 'telegram'
        chatId: '123456789'

# ── Logging ───────────────────────────────────────────────────────────────
logging:
  level: 'info'  # debug, info, warn, error
```

### Config Validation

Validate config before starting:

```bash
flynn doctor --config /etc/flynn/config.yaml
```

## Monitoring

### Health Checks

Flynn provides a health check endpoint:

```bash
# HTTP health check
curl http://localhost:18800/health

# Response
{
  "status": "ok",
  "version": "0.1.0",
  "uptime": 12345
}
```

### Logs

#### Journalctl (systemd)

```bash
# Follow logs
sudo journalctl -u flynn -f

# View last 100 lines
sudo journalctl -u flynn -n 100 --no-pager

# View logs since yesterday
sudo journalctl -u flynn --since yesterday

# Search for errors
sudo journalctl -u flynn | grep -i error
```

#### Log Rotation

Configure logrotate for systemd journal:

```bash
sudo vim /etc/systemd/journald.conf
```

```
[Journal]
SystemMaxUse=100M
MaxRetentionSec=7day
```

Restart systemd:
```bash
sudo systemctl restart systemd-journald
```

### Heartbeat Monitor

Enable built-in heartbeat monitoring:

```yaml
automation:
  heartbeat:
    enabled: true
    interval: '5m'
    checks:
      - 'gateway'
      - 'model'
      - 'channels'
      - 'memory'
      - 'disk'
    notifications:
      - type: 'telegram'
        chatId: '123456789'
      - type: 'webhook'
        url: 'https://hooks.slack.com/services/...'
```

### External Monitoring

#### Prometheus (Optional)

Use Node.js prom-client for metrics (not currently implemented):

```yaml
# Future feature
monitoring:
  prometheus:
    enabled: true
    port: 9090
```

#### Uptime Monitoring

Use external services:
- UptimeRobot
- Pingdom
- Better Uptime

Monitor:
- Gateway HTTP health endpoint
- WebSocket connection
- Response time

## Backup & Recovery

### What to Backup

1. **Configuration**: `/etc/flynn/config.yaml`
2. **Sessions**: SQLite database at `~/.local/share/flynn/sessions.db`
3. **Memory Files**: `~/.local/share/flynn/memory/`
4. **Vectors**: SQLite database at `~/.local/share/flynn/vectors.db`
5. **Pairing Codes**: SQLite table within sessions.db

### Backup Script

Create `/usr/local/bin/flynn-backup.sh`:

```bash
#!/bin/bash
set -e

BACKUP_DIR="/var/backups/flynn"
DATA_DIR="/var/lib/flynn"
CONFIG_DIR="/etc/flynn"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/flynn_$DATE.tar.gz"

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Stop Flynn
sudo systemctl stop flynn

# Create backup
tar -czf "$BACKUP_FILE" \
  "$CONFIG_DIR/config.yaml" \
  "$DATA_DIR/sessions.db" \
  "$DATA_DIR/vectors.db" \
  "$DATA_DIR/memory/"

# Compress old backups (keep last 7 daily, 4 weekly, 12 monthly)
find "$BACKUP_DIR" -name "flynn_*.tar.gz" -mtime +90 -delete

# Restart Flynn
sudo systemctl start flynn

echo "Backup created: $BACKUP_FILE"
```

Make executable:
```bash
sudo chmod +x /usr/local/bin/flynn-backup.sh
```

### Cron Job

Add to root crontab:

```bash
sudo crontab -e
```

```
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/flynn-backup.sh >> /var/log/flynn-backup.log 2>&1
```

### Restore

```bash
# Stop Flynn
sudo systemctl stop flynn

# Extract backup
sudo tar -xzf /var/backups/flynn/flynn_20250213_020000.tar.gz -C /

# Start Flynn
sudo systemctl start flynn
```

### Database Maintenance

Run SQLite vacuum periodically:

```bash
sqlite3 /var/lib/flynn/sessions.db "VACUUM;"
sqlite3 /var/lib/flynn/vectors.db "VACUUM;"
```

Add to crontab (monthly):
```
0 0 1 * * sqlite3 /var/lib/flynn/sessions.db "VACUUM;" >> /var/log/flynn-maintenance.log 2>&1
```

## Performance Tuning

### Node.js Tuning

Set Node.js options for production:

```bash
# In systemd service
Environment="NODE_OPTIONS=--max-old-space-size=2048"

# Or via environment variable
export NODE_OPTIONS="--max-old-space-size=2048"
```

### Context Management

Optimize compaction settings:

```yaml
agents:
  default:
    compaction:
      thresholdPct: 75  # Trigger earlier
      keepTurns: 6      # Keep more context
      summaryMaxTokens: 2048  # Better summaries
```

### SQLite Performance

Enable WAL mode:

```bash
sqlite3 /var/lib/flynn/sessions.db "PRAGMA journal_mode=WAL;"
sqlite3 /var/lib/flynn/sessions.db "PRAGMA synchronous=NORMAL;"
sqlite3 /var/lib/flynn/sessions.db "PRAGMA cache_size=-64000;"  # 64MB
```

### Model Routing

Configure tiers for optimal cost/latency:

```yaml
models:
  router:
    tiers:
      fast: 'anthropic:claude-haiku-4-20250514'      # Quick tasks
      default: 'anthropic:claude-sonnet-4-20250514'  # General use
      complex: 'anthropic:claude-opus-4-20250514'     # Complex reasoning
      local: 'ollama:llama3'                          # Fallback
```

### Caching (Future)

Consider adding caching for:
- Repeated tool calls
- Memory search results
- Model responses for common queries

## Scaling Considerations

### Single-Operator Scope

Flynn is designed for a single operator with multiple concurrent users. Limitations:

- **Max Concurrent Sessions**: ~100 (depends on model rate limits)
- **Throughput**: ~10-20 messages/second (varies by model)
- **Memory Usage**: 2-4GB for moderate usage

### When to Scale Up

Consider scaling if:
- Consistent CPU usage > 80%
- Memory usage > 4GB
- Frequent rate limiting from model providers
- Slow response times > 30 seconds

### Scaling Strategies

1. **Horizontal Scaling**: Deploy multiple Flynn instances behind a load balancer (not currently supported - sessions are stateful)

2. **Vertical Scaling**: Increase server resources (CPU, memory)

3. **Multi-Instance Architecture** (future):
   - Shared session storage (PostgreSQL/Redis)
   - Message queue for request distribution
   - Session affinity for stateful connections

### Cost Optimization

- Use local models for non-critical tasks
- Cache embeddings
- Optimize compaction to reduce token usage
- Use efficient models for delegated tasks

## Troubleshooting Production Issues

### Service Won't Start

```bash
# Check status
sudo systemctl status flynn

# View logs
sudo journalctl -u flynn -n 50 --no-pager

# Validate config
flynn doctor --config /etc/flynn/config.yaml
```

### High Memory Usage

```bash
# Check memory
free -h

# Check process memory
ps aux | grep flynn

# Restart service
sudo systemctl restart flynn
```

### Gateway Connection Issues

```bash
# Check if port is listening
sudo ss -tlnp | grep 18800

# Check firewall
sudo ufw status

# Test connectivity
curl http://localhost:18800/health
```

### Slow Response Times

```bash
# Check CPU usage
top

# Check model provider status
# Verify API keys are valid
# Check network latency

# Enable debug logging
DEBUG='*' sudo systemctl restart flynn
```

---

For additional help, see:
- [TROUBLESHOOTING.md](../../TROUBLESHOOTING.md)
- [README.md](../../README.md)
- GitHub Issues