Add hook handler that forwards OpenClaw agent events to the agentmon ingest endpoint for monitoring and observability. - ansible/playbooks/files/agentmon-hook/: Ansible-deployable hook - openclaw/hooks/agentmon/: Hook installed in OpenClaw instance Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
swarm
This directory is the source of truth for the OpenClaw VM infrastructure. It is shared into the zap VM via virtiofs (mounted at /mnt/swarm inside the guest, active after reboot).
Directory Structure
swarm/
├── ansible/ # VM provisioning and configuration
│ ├── inventory.yml # Host definitions
│ ├── host_vars/
│ │ └── zap.yml # All zap-specific variables
│ ├── playbooks/
│ │ ├── provision-vm.yml # Create the VM on the hypervisor
│ │ ├── install.yml # Install OpenClaw on the guest
│ │ └── customize.yml # Post-provision tweaks
│ └── roles/
│ ├── openclaw/ # Upstream role (from openclaw-ansible)
│ └── vm/ # VM provisioning role (local)
├── openclaw/ # Live mirror of guest ~/.openclaw/
├── docker-compose.yaml # LiteLLM + supporting services
├── litellm-config.yaml # LiteLLM static config
├── litellm-init-credentials.sh # Register API keys into LiteLLM DB
├── litellm-init-models.sh # Register models into LiteLLM DB (idempotent)
├── litellm-dedup.sh # Remove duplicate model DB entries
├── litellm-health-check.sh # Liveness check + auto-dedup (run by systemd timer)
├── backup-openclaw-vm.sh # Sync openclaw/ + upload to MinIO
├── restore-openclaw-vm.sh # Full VM redeploy from scratch
└── README.md # This file
VM: zap
| Property | Value |
|---|---|
| Libvirt domain | zap [claw] |
| Guest hostname | zap |
| IP | 192.168.122.182 (static DHCP) |
| MAC | 52:54:00:01:00:71 |
| RAM | 3 GiB |
| vCPUs | 2 |
| Disk | /var/lib/libvirt/images/claw.qcow2 (60 GiB qcow2) |
| OS | Ubuntu 24.04 |
| Firmware | EFI + Secure Boot + TPM 2.0 |
| Autostart | enabled |
| virtiofs | ~/lab/swarm → /mnt/swarm (active after reboot) |
| Swappiness | 10 |
SSH access:
ssh root@192.168.122.182 # privileged operations
ssh openclaw@192.168.122.182 # application-level access
Provisioning a New VM
Use this when deploying zap from scratch on a fresh hypervisor, or creating a new instance.
Step 1 — Create the VM
cd ~/lab/swarm/ansible
ansible-playbook -i inventory.yml playbooks/provision-vm.yml --limit zap
This will:
- Download the Ubuntu 24.04 cloud image (cached at
/var/lib/libvirt/images/) - Create the disk image via copy-on-write (
claw.qcow2, 60 GiB) - Build a cloud-init seed ISO with your SSH key and hostname
- Define the VM XML (EFI, memfd shared memory, virtiofs, TPM, watchdog)
- Add a static DHCP reservation for the MAC/IP pair
- Enable autostart and start the VM
- Wait for SSH to become available
Step 2 — Install OpenClaw
ansible-playbook -i inventory.yml playbooks/install.yml --limit zap
Installs Node.js, pnpm, Docker, UFW, fail2ban, Tailscale, and OpenClaw via the upstream openclaw-ansible role.
Step 3 — Apply customizations
ansible-playbook -i inventory.yml playbooks/customize.yml --limit zap
Applies settings not covered by the upstream role:
vm.swappiness=10(live + persisted)- virtiofs fstab entry (
swarm→/mnt/swarm) loginctl enable-linger openclaw(for user systemd services)
Step 4 — Restore config
~/lab/swarm/restore-openclaw-vm.sh zap
Rsyncs openclaw/ back to ~/.openclaw/ on the guest and restarts the gateway service.
All-in-one redeploy
# Existing VM (just re-provision guest)
~/lab/swarm/restore-openclaw-vm.sh zap
# Fresh VM at a new IP
~/lab/swarm/restore-openclaw-vm.sh zap <new-ip>
When a target IP is passed, restore-openclaw-vm.sh runs all four steps above in sequence.
Backup
The openclaw/ directory is a live rsync mirror of the guest's ~/.openclaw/, automatically updated daily at 03:00 by a systemd user timer.
# Run manually
~/lab/swarm/backup-openclaw-vm.sh zap
# Check timer status
systemctl --user status openclaw-backup.timer
systemctl --user list-timers openclaw-backup.timer
What is backed up
| Included | Excluded |
|---|---|
openclaw.json (main config) |
workspace/ (2.6 GiB conversation history) |
secrets.json (API keys) |
logs/ |
credentials/, identity/ |
extensions-quarantine/ |
memory/, agents/ |
*.bak*, *.backup-*, *.pre-*, *.failed |
hooks/, cron/, telegram/ |
|
workspace-*/ (provider workspaces) |
MinIO
Timestamped archives are uploaded to MinIO on every backup run:
| Property | Value |
|---|---|
| Endpoint | http://192.168.153.253:9000 |
| Bucket | s3://zap/backups/ |
| Retention | 7 most recent archives |
| Credentials | ~/.aws/credentials (default profile) |
To list available archives:
aws s3 ls s3://zap/backups/
LiteLLM
LiteLLM runs as a Docker service (litellm, port 18804) backed by a Postgres database (litellm-db). It acts as a unified OpenAI-compatible proxy over Anthropic, OpenAI, Gemini, ZAI/GLM, and GitHub Copilot.
Starting
cd ~/lab/swarm
docker compose --profile api up -d
Credentials and model registration
On first start, litellm-init registers API credentials and all models into the DB. It is idempotent — re-running it when models already exist is a no-op (guarded by a gpt-4o sentinel check). To force a re-run (e.g. to add newly-added models to the script):
docker compose --profile api run --rm \
-e FORCE=1 litellm-init
Adding a new model
- Add an
add_model(oradd_copilot_model) call tolitellm-init-models.sh - Register it live via the API (no restart needed):
source .env
curl -X POST http://localhost:18804/model/new \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model_name":"<name>","litellm_params":{"model":"<provider>/<model>","api_key":"os.environ/<KEY_VAR>"}}'
Maintenance scripts
| Script | Purpose |
|---|---|
litellm-dedup.sh |
Remove duplicate model DB entries (run --dry-run to preview) |
litellm-health-check.sh |
Liveness check + auto-dedup; run by systemd timer |
# Manual dedup
./litellm-dedup.sh
# Manual health check
./litellm-health-check.sh
# Check maintenance log
tail -f litellm-maintenance.log
Systemd timer
litellm-health-check.timer runs every 6 hours (user session, enabled at install). It checks liveness (restarting the container if unresponsive) and removes any duplicate model entries.
systemctl --user status litellm-health-check.timer
systemctl --user list-timers litellm-health-check.timer
journalctl --user -u litellm-health-check.service -n 20
Troubleshooting
Model returns 429 "No deployments available" All deployments for that model group are in cooldown (usually from a transient upstream error). Restart litellm to clear:
docker restart litellm
Model returns upstream subscription error The API key in use does not have access to that model. Check the provider's plan. The model will stay in cooldown until restarted; consider removing it from the DB if access is not expected.
Duplicate model entries
Caused by running litellm-init multiple times. Run ./litellm-dedup.sh to clean up. The health-check timer also auto-deduplicates when DEDUP=1 (default).
Adding a New Instance
- Add an entry to
ansible/inventory.yml - Create
ansible/host_vars/<name>.ymlwith VM and OpenClaw variables (copyhost_vars/zap.ymlas a template) - Run the four provisioning steps above
- Add the instance to
~/.claude/state/openclaw-instances.json - Add a backup timer: copy
~/.config/systemd/user/openclaw-backup.{service,timer}, update the instance name, reload
Ansible Role Reference
vm role (roles/vm/)
Provisions the KVM/libvirt VM on the hypervisor host. Variables (set in host_vars):
| Variable | Description | Example |
|---|---|---|
vm_domain |
Libvirt domain name | "zap [claw]" |
vm_hostname |
Guest hostname | zap |
vm_memory_mib |
RAM in MiB | 3072 |
vm_vcpus |
vCPU count | 2 |
vm_disk_path |
qcow2 path on host | /var/lib/libvirt/images/claw.qcow2 |
vm_disk_size |
Disk size | 60G |
vm_mac |
Network MAC address | 52:54:00:01:00:71 |
vm_ip |
Static DHCP IP | 192.168.122.182 |
vm_virtiofs_source |
Host path to share | /home/will/lab/swarm |
vm_virtiofs_tag |
Mount tag in guest | swarm |
openclaw role (roles/openclaw/)
Upstream role from openclaw-ansible. Installs and configures OpenClaw on the guest. Key variables:
| Variable | Value |
|---|---|
openclaw_install_mode |
release |
openclaw_ssh_keys |
will's public key |
customize.yml playbook
Post-provision tweaks applied after the upstream role:
vm.swappiness = 10/etc/fstabentry for virtiofsswarmshareloginctl enable-linger openclaw