2026-03-20 10:33:36 -07:00

swarm

This directory is the source of truth for the OpenClaw VM infrastructure. It is shared into the zap VM via virtiofs (mounted at /mnt/swarm inside the guest, active after reboot).

Directory Structure

swarm/
├── ansible/                    # VM provisioning and configuration
│   ├── inventory.yml           # Host definitions
│   ├── host_vars/
│   │   └── zap.yml             # All zap-specific variables
│   ├── playbooks/
│   │   ├── provision-vm.yml    # Create the VM on the hypervisor
│   │   ├── install.yml         # Install OpenClaw on the guest
│   │   └── customize.yml       # Post-provision tweaks
│   └── roles/
│       ├── openclaw/           # Upstream role (from openclaw-ansible)
│       └── vm/                 # VM provisioning role (local)
├── openclaw/                   # Live mirror of guest ~/.openclaw/
├── docker-compose.yaml         # LiteLLM + supporting services
├── litellm-config.yaml         # LiteLLM static config
├── litellm-init-credentials.sh # Register API keys into LiteLLM DB
├── litellm-init-models.sh      # Register models into LiteLLM DB (idempotent)
├── litellm-dedup.sh            # Remove duplicate model DB entries
├── litellm-health-check.sh     # Liveness check + auto-dedup (run by systemd timer)
├── backup-openclaw-vm.sh       # Sync openclaw/ + upload to MinIO
├── restore-openclaw-vm.sh      # Full VM redeploy from scratch
└── README.md                   # This file

VM: zap

Property Value
Libvirt domain zap [claw]
Guest hostname zap
IP 192.168.122.182 (static DHCP)
MAC 52:54:00:01:00:71
RAM 3 GiB
vCPUs 2
Disk /var/lib/libvirt/images/claw.qcow2 (60 GiB qcow2)
OS Ubuntu 24.04
Firmware EFI + Secure Boot + TPM 2.0
Autostart enabled
virtiofs ~/lab/swarm/mnt/swarm (active after reboot)
Swappiness 10

SSH access:

ssh root@192.168.122.182      # privileged operations
ssh openclaw@192.168.122.182  # application-level access

Provisioning a New VM

Use this when deploying zap from scratch on a fresh hypervisor, or creating a new instance.

Step 1 — Create the VM

cd ~/lab/swarm/ansible
ansible-playbook -i inventory.yml playbooks/provision-vm.yml --limit zap

This will:

  • Download the Ubuntu 24.04 cloud image (cached at /var/lib/libvirt/images/)
  • Create the disk image via copy-on-write (claw.qcow2, 60 GiB)
  • Build a cloud-init seed ISO with your SSH key and hostname
  • Define the VM XML (EFI, memfd shared memory, virtiofs, TPM, watchdog)
  • Add a static DHCP reservation for the MAC/IP pair
  • Enable autostart and start the VM
  • Wait for SSH to become available

Step 2 — Install OpenClaw

ansible-playbook -i inventory.yml playbooks/install.yml --limit zap

Installs Node.js, pnpm, Docker, UFW, fail2ban, Tailscale, and OpenClaw via the upstream openclaw-ansible role.

Step 3 — Apply customizations

ansible-playbook -i inventory.yml playbooks/customize.yml --limit zap

Applies settings not covered by the upstream role:

  • vm.swappiness=10 (live + persisted)
  • virtiofs fstab entry (swarm/mnt/swarm)
  • loginctl enable-linger openclaw (for user systemd services)

Step 4 — Restore config

~/lab/swarm/restore-openclaw-vm.sh zap

Rsyncs openclaw/ back to ~/.openclaw/ on the guest and restarts the gateway service.

All-in-one redeploy

# Existing VM (just re-provision guest)
~/lab/swarm/restore-openclaw-vm.sh zap

# Fresh VM at a new IP
~/lab/swarm/restore-openclaw-vm.sh zap <new-ip>

When a target IP is passed, restore-openclaw-vm.sh runs all four steps above in sequence.

Backup

The openclaw/ directory is a live rsync mirror of the guest's ~/.openclaw/, automatically updated daily at 03:00 by a systemd user timer.

# Run manually
~/lab/swarm/backup-openclaw-vm.sh zap

# Check timer status
systemctl --user status openclaw-backup.timer
systemctl --user list-timers openclaw-backup.timer

What is backed up

Included Excluded
openclaw.json (main config) workspace/ (2.6 GiB conversation history)
secrets.json (API keys) logs/
credentials/, identity/ extensions-quarantine/
memory/, agents/ *.bak*, *.backup-*, *.pre-*, *.failed
hooks/, cron/, telegram/
workspace-*/ (provider workspaces)

MinIO

Timestamped archives are uploaded to MinIO on every backup run:

Property Value
Endpoint http://192.168.153.253:9000
Bucket s3://zap/backups/
Retention 7 most recent archives
Credentials ~/.aws/credentials (default profile)

To list available archives:

aws s3 ls s3://zap/backups/

LiteLLM

LiteLLM runs as a Docker service (litellm, port 18804) backed by a Postgres database (litellm-db). It acts as a unified OpenAI-compatible proxy over Anthropic, OpenAI, Gemini, ZAI/GLM, and GitHub Copilot.

Starting

cd ~/lab/swarm
docker compose --profile api up -d

Credentials and model registration

On first start, litellm-init registers API credentials and all models into the DB. It is idempotent — re-running it when models already exist is a no-op (guarded by a gpt-4o sentinel check). To force a re-run (e.g. to add newly-added models to the script):

docker compose --profile api run --rm \
  -e FORCE=1 litellm-init

Adding a new model

  1. Add an add_model (or add_copilot_model) call to litellm-init-models.sh
  2. Register it live via the API (no restart needed):
source .env
curl -X POST http://localhost:18804/model/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model_name":"<name>","litellm_params":{"model":"<provider>/<model>","api_key":"os.environ/<KEY_VAR>"}}'

Maintenance scripts

Script Purpose
litellm-dedup.sh Remove duplicate model DB entries (run --dry-run to preview)
litellm-health-check.sh Liveness check + auto-dedup; run by systemd timer
# Manual dedup
./litellm-dedup.sh

# Manual health check
./litellm-health-check.sh

# Check maintenance log
tail -f litellm-maintenance.log

Systemd timer

litellm-health-check.timer runs every 6 hours (user session, enabled at install). It checks liveness (restarting the container if unresponsive) and removes any duplicate model entries.

systemctl --user status litellm-health-check.timer
systemctl --user list-timers litellm-health-check.timer
journalctl --user -u litellm-health-check.service -n 20

Troubleshooting

Model returns 429 "No deployments available" All deployments for that model group are in cooldown (usually from a transient upstream error). Restart litellm to clear:

docker restart litellm

Model returns upstream subscription error The API key in use does not have access to that model. Check the provider's plan. The model will stay in cooldown until restarted; consider removing it from the DB if access is not expected.

Duplicate model entries Caused by running litellm-init multiple times. Run ./litellm-dedup.sh to clean up. The health-check timer also auto-deduplicates when DEDUP=1 (default).

Adding a New Instance

  1. Add an entry to ansible/inventory.yml
  2. Create ansible/host_vars/<name>.yml with VM and OpenClaw variables (copy host_vars/zap.yml as a template)
  3. Run the four provisioning steps above
  4. Add the instance to ~/.claude/state/openclaw-instances.json
  5. Add a backup timer: copy ~/.config/systemd/user/openclaw-backup.{service,timer}, update the instance name, reload

Ansible Role Reference

vm role (roles/vm/)

Provisions the KVM/libvirt VM on the hypervisor host. Variables (set in host_vars):

Variable Description Example
vm_domain Libvirt domain name "zap [claw]"
vm_hostname Guest hostname zap
vm_memory_mib RAM in MiB 3072
vm_vcpus vCPU count 2
vm_disk_path qcow2 path on host /var/lib/libvirt/images/claw.qcow2
vm_disk_size Disk size 60G
vm_mac Network MAC address 52:54:00:01:00:71
vm_ip Static DHCP IP 192.168.122.182
vm_virtiofs_source Host path to share /home/will/lab/swarm
vm_virtiofs_tag Mount tag in guest swarm

openclaw role (roles/openclaw/)

Upstream role from openclaw-ansible. Installs and configures OpenClaw on the guest. Key variables:

Variable Value
openclaw_install_mode release
openclaw_ssh_keys will's public key

customize.yml playbook

Post-provision tweaks applied after the upstream role:

  • vm.swappiness = 10
  • /etc/fstab entry for virtiofs swarm share
  • loginctl enable-linger openclaw
Description
No description provided
Readme 1,008 KiB
Languages
Shell 67.2%
PowerShell 21%
JavaScript 5.2%
TypeScript 3.4%
Jinja 2.5%
Other 0.7%