# swarm This directory is the source of truth for the OpenClaw VM infrastructure. It is shared into the `zap` VM via virtiofs (mounted at `/mnt/swarm` inside the guest, active after reboot). ## Directory Structure ``` swarm/ ├── ansible/ # VM provisioning and configuration │ ├── inventory.yml # Host definitions │ ├── host_vars/ │ │ └── zap.yml # All zap-specific variables │ ├── playbooks/ │ │ ├── provision-vm.yml # Create the VM on the hypervisor │ │ ├── install.yml # Install OpenClaw on the guest │ │ └── customize.yml # Post-provision tweaks │ └── roles/ │ ├── openclaw/ # Upstream role (from openclaw-ansible) │ └── vm/ # VM provisioning role (local) ├── openclaw/ # Live mirror of guest ~/.openclaw/ ├── docker-compose.yaml # LiteLLM + supporting services ├── litellm-config.yaml # LiteLLM static config ├── litellm-init-credentials.sh # Register API keys into LiteLLM DB ├── litellm-init-models.sh # Register models into LiteLLM DB (idempotent) ├── litellm-dedup.sh # Remove duplicate model DB entries ├── litellm-health-check.sh # Liveness check + auto-dedup (run by systemd timer) ├── backup-openclaw-vm.sh # Sync openclaw/ + upload to MinIO ├── restore-openclaw-vm.sh # Full VM redeploy from scratch └── README.md # This file ``` ## VM: zap | Property | Value | |----------|-------| | Libvirt domain | `zap [claw]` | | Guest hostname | `zap` | | IP | `192.168.122.182` (static DHCP) | | MAC | `52:54:00:01:00:71` | | RAM | 3 GiB | | vCPUs | 2 | | Disk | `/var/lib/libvirt/images/claw.qcow2` (60 GiB qcow2) | | OS | Ubuntu 24.04 | | Firmware | EFI + Secure Boot + TPM 2.0 | | Autostart | enabled | | virtiofs | `~/lab/swarm` → `/mnt/swarm` (active after reboot) | | Swappiness | 10 | SSH access: ```bash ssh root@192.168.122.182 # privileged operations ssh openclaw@192.168.122.182 # application-level access ``` ## Provisioning a New VM Use this when deploying zap from scratch on a fresh hypervisor, or creating a new instance. ### Step 1 — Create the VM ```bash cd ~/lab/swarm/ansible ansible-playbook -i inventory.yml playbooks/provision-vm.yml --limit zap ``` This will: - Download the Ubuntu 24.04 cloud image (cached at `/var/lib/libvirt/images/`) - Create the disk image via copy-on-write (`claw.qcow2`, 60 GiB) - Build a cloud-init seed ISO with your SSH key and hostname - Define the VM XML (EFI, memfd shared memory, virtiofs, TPM, watchdog) - Add a static DHCP reservation for the MAC/IP pair - Enable autostart and start the VM - Wait for SSH to become available ### Step 2 — Install OpenClaw ```bash ansible-playbook -i inventory.yml playbooks/install.yml --limit zap ``` Installs Node.js, pnpm, Docker, UFW, fail2ban, Tailscale, and OpenClaw via the upstream `openclaw-ansible` role. ### Step 3 — Apply customizations ```bash ansible-playbook -i inventory.yml playbooks/customize.yml --limit zap ``` Applies settings not covered by the upstream role: - `vm.swappiness=10` (live + persisted) - virtiofs fstab entry (`swarm` → `/mnt/swarm`) - `loginctl enable-linger openclaw` (for user systemd services) ### Step 4 — Restore config ```bash ~/lab/swarm/restore-openclaw-vm.sh zap ``` Rsyncs `openclaw/` back to `~/.openclaw/` on the guest and restarts the gateway service. ### All-in-one redeploy ```bash # Existing VM (just re-provision guest) ~/lab/swarm/restore-openclaw-vm.sh zap # Fresh VM at a new IP ~/lab/swarm/restore-openclaw-vm.sh zap ``` When a target IP is passed, `restore-openclaw-vm.sh` runs all four steps above in sequence. ## Backup The `openclaw/` directory is a live rsync mirror of the guest's `~/.openclaw/`, automatically updated daily at 03:00 by a systemd user timer. ```bash # Run manually ~/lab/swarm/backup-openclaw-vm.sh zap # Check timer status systemctl --user status openclaw-backup.timer systemctl --user list-timers openclaw-backup.timer ``` ### What is backed up | Included | Excluded | |----------|----------| | `openclaw.json` (main config) | `workspace/` (2.6 GiB conversation history) | | `secrets.json` (API keys) | `logs/` | | `credentials/`, `identity/` | `extensions-quarantine/` | | `memory/`, `agents/` | `*.bak*`, `*.backup-*`, `*.pre-*`, `*.failed` | | `hooks/`, `cron/`, `telegram/` | | | `workspace-*/` (provider workspaces) | | ### MinIO Timestamped archives are uploaded to MinIO on every backup run: | Property | Value | |----------|-------| | Endpoint | `http://192.168.153.253:9000` | | Bucket | `s3://zap/backups/` | | Retention | 7 most recent archives | | Credentials | `~/.aws/credentials` (default profile) | To list available archives: ```bash aws s3 ls s3://zap/backups/ ``` ## LiteLLM LiteLLM runs as a Docker service (`litellm`, port 18804) backed by a Postgres database (`litellm-db`). It acts as a unified OpenAI-compatible proxy over Anthropic, OpenAI, Gemini, ZAI/GLM, and GitHub Copilot. ### Starting ```bash cd ~/lab/swarm docker compose --profile api up -d ``` ### Credentials and model registration On first start, `litellm-init` registers API credentials and all models into the DB. It is idempotent — re-running it when models already exist is a no-op (guarded by a `gpt-4o` sentinel check). To force a re-run (e.g. to add newly-added models to the script): ```bash docker compose --profile api run --rm \ -e FORCE=1 litellm-init ``` ### Adding a new model 1. Add an `add_model` (or `add_copilot_model`) call to `litellm-init-models.sh` 2. Register it live via the API (no restart needed): ```bash source .env curl -X POST http://localhost:18804/model/new \ -H "Authorization: Bearer $LITELLM_MASTER_KEY" \ -H "Content-Type: application/json" \ -d '{"model_name":"","litellm_params":{"model":"/","api_key":"os.environ/"}}' ``` ### Maintenance scripts | Script | Purpose | |--------|---------| | `litellm-dedup.sh` | Remove duplicate model DB entries (run `--dry-run` to preview) | | `litellm-health-check.sh` | Liveness check + auto-dedup; run by systemd timer | ```bash # Manual dedup ./litellm-dedup.sh # Manual health check ./litellm-health-check.sh # Check maintenance log tail -f litellm-maintenance.log ``` ### Systemd timer `litellm-health-check.timer` runs every 6 hours (user session, enabled at install). It checks liveness (restarting the container if unresponsive) and removes any duplicate model entries. ```bash systemctl --user status litellm-health-check.timer systemctl --user list-timers litellm-health-check.timer journalctl --user -u litellm-health-check.service -n 20 ``` ### Troubleshooting **Model returns 429 "No deployments available"** All deployments for that model group are in cooldown (usually from a transient upstream error). Restart litellm to clear: ```bash docker restart litellm ``` **Model returns upstream subscription error** The API key in use does not have access to that model. Check the provider's plan. The model will stay in cooldown until restarted; consider removing it from the DB if access is not expected. **Duplicate model entries** Caused by running `litellm-init` multiple times. Run `./litellm-dedup.sh` to clean up. The health-check timer also auto-deduplicates when `DEDUP=1` (default). ## Adding a New Instance 1. Add an entry to `ansible/inventory.yml` 2. Create `ansible/host_vars/.yml` with VM and OpenClaw variables (copy `host_vars/zap.yml` as a template) 3. Run the four provisioning steps above 4. Add the instance to `~/.claude/state/openclaw-instances.json` 5. Add a backup timer: copy `~/.config/systemd/user/openclaw-backup.{service,timer}`, update the instance name, reload ## Ansible Role Reference ### `vm` role (`roles/vm/`) Provisions the KVM/libvirt VM on the hypervisor host. Variables (set in `host_vars`): | Variable | Description | Example | |----------|-------------|---------| | `vm_domain` | Libvirt domain name | `"zap [claw]"` | | `vm_hostname` | Guest hostname | `zap` | | `vm_memory_mib` | RAM in MiB | `3072` | | `vm_vcpus` | vCPU count | `2` | | `vm_disk_path` | qcow2 path on host | `/var/lib/libvirt/images/claw.qcow2` | | `vm_disk_size` | Disk size | `60G` | | `vm_mac` | Network MAC address | `52:54:00:01:00:71` | | `vm_ip` | Static DHCP IP | `192.168.122.182` | | `vm_virtiofs_source` | Host path to share | `/home/will/lab/swarm` | | `vm_virtiofs_tag` | Mount tag in guest | `swarm` | ### `openclaw` role (`roles/openclaw/`) Upstream role from [openclaw-ansible](https://github.com/openclaw/openclaw-ansible). Installs and configures OpenClaw on the guest. Key variables: | Variable | Value | |----------|-------| | `openclaw_install_mode` | `release` | | `openclaw_ssh_keys` | will's public key | ### `customize.yml` playbook Post-provision tweaks applied after the upstream role: - `vm.swappiness = 10` - `/etc/fstab` entry for virtiofs `swarm` share - `loginctl enable-linger openclaw`