Add LiteLLM section to README covering: service startup, credential and model registration (including FORCE=1 for re-runs), adding new models via API, maintenance scripts, systemd timer, and a troubleshooting guide for the 429/cooldown and duplicate-entry failure modes encountered in practice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
273 lines
9.0 KiB
Markdown
273 lines
9.0 KiB
Markdown
# swarm
|
|
|
|
This directory is the source of truth for the OpenClaw VM infrastructure. It is shared into the `zap` VM via virtiofs (mounted at `/mnt/swarm` inside the guest, active after reboot).
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
swarm/
|
|
├── ansible/ # VM provisioning and configuration
|
|
│ ├── inventory.yml # Host definitions
|
|
│ ├── host_vars/
|
|
│ │ └── zap.yml # All zap-specific variables
|
|
│ ├── playbooks/
|
|
│ │ ├── provision-vm.yml # Create the VM on the hypervisor
|
|
│ │ ├── install.yml # Install OpenClaw on the guest
|
|
│ │ └── customize.yml # Post-provision tweaks
|
|
│ └── roles/
|
|
│ ├── openclaw/ # Upstream role (from openclaw-ansible)
|
|
│ └── vm/ # VM provisioning role (local)
|
|
├── openclaw/ # Live mirror of guest ~/.openclaw/
|
|
├── docker-compose.yaml # LiteLLM + supporting services
|
|
├── litellm-config.yaml # LiteLLM static config
|
|
├── litellm-init-credentials.sh # Register API keys into LiteLLM DB
|
|
├── litellm-init-models.sh # Register models into LiteLLM DB (idempotent)
|
|
├── litellm-dedup.sh # Remove duplicate model DB entries
|
|
├── litellm-health-check.sh # Liveness check + auto-dedup (run by systemd timer)
|
|
├── backup-openclaw-vm.sh # Sync openclaw/ + upload to MinIO
|
|
├── restore-openclaw-vm.sh # Full VM redeploy from scratch
|
|
└── README.md # This file
|
|
```
|
|
|
|
## VM: zap
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| Libvirt domain | `zap [claw]` |
|
|
| Guest hostname | `zap` |
|
|
| IP | `192.168.122.182` (static DHCP) |
|
|
| MAC | `52:54:00:01:00:71` |
|
|
| RAM | 3 GiB |
|
|
| vCPUs | 2 |
|
|
| Disk | `/var/lib/libvirt/images/claw.qcow2` (60 GiB qcow2) |
|
|
| OS | Ubuntu 24.04 |
|
|
| Firmware | EFI + Secure Boot + TPM 2.0 |
|
|
| Autostart | enabled |
|
|
| virtiofs | `~/lab/swarm` → `/mnt/swarm` (active after reboot) |
|
|
| Swappiness | 10 |
|
|
|
|
SSH access:
|
|
|
|
```bash
|
|
ssh root@192.168.122.182 # privileged operations
|
|
ssh openclaw@192.168.122.182 # application-level access
|
|
```
|
|
|
|
## Provisioning a New VM
|
|
|
|
Use this when deploying zap from scratch on a fresh hypervisor, or creating a new instance.
|
|
|
|
### Step 1 — Create the VM
|
|
|
|
```bash
|
|
cd ~/lab/swarm/ansible
|
|
ansible-playbook -i inventory.yml playbooks/provision-vm.yml --limit zap
|
|
```
|
|
|
|
This will:
|
|
- Download the Ubuntu 24.04 cloud image (cached at `/var/lib/libvirt/images/`)
|
|
- Create the disk image via copy-on-write (`claw.qcow2`, 60 GiB)
|
|
- Build a cloud-init seed ISO with your SSH key and hostname
|
|
- Define the VM XML (EFI, memfd shared memory, virtiofs, TPM, watchdog)
|
|
- Add a static DHCP reservation for the MAC/IP pair
|
|
- Enable autostart and start the VM
|
|
- Wait for SSH to become available
|
|
|
|
### Step 2 — Install OpenClaw
|
|
|
|
```bash
|
|
ansible-playbook -i inventory.yml playbooks/install.yml --limit zap
|
|
```
|
|
|
|
Installs Node.js, pnpm, Docker, UFW, fail2ban, Tailscale, and OpenClaw via the upstream `openclaw-ansible` role.
|
|
|
|
### Step 3 — Apply customizations
|
|
|
|
```bash
|
|
ansible-playbook -i inventory.yml playbooks/customize.yml --limit zap
|
|
```
|
|
|
|
Applies settings not covered by the upstream role:
|
|
- `vm.swappiness=10` (live + persisted)
|
|
- virtiofs fstab entry (`swarm` → `/mnt/swarm`)
|
|
- `loginctl enable-linger openclaw` (for user systemd services)
|
|
|
|
### Step 4 — Restore config
|
|
|
|
```bash
|
|
~/lab/swarm/restore-openclaw-vm.sh zap
|
|
```
|
|
|
|
Rsyncs `openclaw/` back to `~/.openclaw/` on the guest and restarts the gateway service.
|
|
|
|
### All-in-one redeploy
|
|
|
|
```bash
|
|
# Existing VM (just re-provision guest)
|
|
~/lab/swarm/restore-openclaw-vm.sh zap
|
|
|
|
# Fresh VM at a new IP
|
|
~/lab/swarm/restore-openclaw-vm.sh zap <new-ip>
|
|
```
|
|
|
|
When a target IP is passed, `restore-openclaw-vm.sh` runs all four steps above in sequence.
|
|
|
|
## Backup
|
|
|
|
The `openclaw/` directory is a live rsync mirror of the guest's `~/.openclaw/`, automatically updated daily at 03:00 by a systemd user timer.
|
|
|
|
```bash
|
|
# Run manually
|
|
~/lab/swarm/backup-openclaw-vm.sh zap
|
|
|
|
# Check timer status
|
|
systemctl --user status openclaw-backup.timer
|
|
systemctl --user list-timers openclaw-backup.timer
|
|
```
|
|
|
|
### What is backed up
|
|
|
|
| Included | Excluded |
|
|
|----------|----------|
|
|
| `openclaw.json` (main config) | `workspace/` (2.6 GiB conversation history) |
|
|
| `secrets.json` (API keys) | `logs/` |
|
|
| `credentials/`, `identity/` | `extensions-quarantine/` |
|
|
| `memory/`, `agents/` | `*.bak*`, `*.backup-*`, `*.pre-*`, `*.failed` |
|
|
| `hooks/`, `cron/`, `telegram/` | |
|
|
| `workspace-*/` (provider workspaces) | |
|
|
|
|
### MinIO
|
|
|
|
Timestamped archives are uploaded to MinIO on every backup run:
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| Endpoint | `http://192.168.153.253:9000` |
|
|
| Bucket | `s3://zap/backups/` |
|
|
| Retention | 7 most recent archives |
|
|
| Credentials | `~/.aws/credentials` (default profile) |
|
|
|
|
To list available archives:
|
|
|
|
```bash
|
|
aws s3 ls s3://zap/backups/
|
|
```
|
|
|
|
## LiteLLM
|
|
|
|
LiteLLM runs as a Docker service (`litellm`, port 18804) backed by a Postgres database (`litellm-db`). It acts as a unified OpenAI-compatible proxy over Anthropic, OpenAI, Gemini, ZAI/GLM, and GitHub Copilot.
|
|
|
|
### Starting
|
|
|
|
```bash
|
|
cd ~/lab/swarm
|
|
docker compose --profile api up -d
|
|
```
|
|
|
|
### Credentials and model registration
|
|
|
|
On first start, `litellm-init` registers API credentials and all models into the DB. It is idempotent — re-running it when models already exist is a no-op (guarded by a `gpt-4o` sentinel check). To force a re-run (e.g. to add newly-added models to the script):
|
|
|
|
```bash
|
|
docker compose --profile api run --rm \
|
|
-e FORCE=1 litellm-init
|
|
```
|
|
|
|
### Adding a new model
|
|
|
|
1. Add an `add_model` (or `add_copilot_model`) call to `litellm-init-models.sh`
|
|
2. Register it live via the API (no restart needed):
|
|
|
|
```bash
|
|
source .env
|
|
curl -X POST http://localhost:18804/model/new \
|
|
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model_name":"<name>","litellm_params":{"model":"<provider>/<model>","api_key":"os.environ/<KEY_VAR>"}}'
|
|
```
|
|
|
|
### Maintenance scripts
|
|
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `litellm-dedup.sh` | Remove duplicate model DB entries (run `--dry-run` to preview) |
|
|
| `litellm-health-check.sh` | Liveness check + auto-dedup; run by systemd timer |
|
|
|
|
```bash
|
|
# Manual dedup
|
|
./litellm-dedup.sh
|
|
|
|
# Manual health check
|
|
./litellm-health-check.sh
|
|
|
|
# Check maintenance log
|
|
tail -f litellm-maintenance.log
|
|
```
|
|
|
|
### Systemd timer
|
|
|
|
`litellm-health-check.timer` runs every 6 hours (user session, enabled at install). It checks liveness (restarting the container if unresponsive) and removes any duplicate model entries.
|
|
|
|
```bash
|
|
systemctl --user status litellm-health-check.timer
|
|
systemctl --user list-timers litellm-health-check.timer
|
|
journalctl --user -u litellm-health-check.service -n 20
|
|
```
|
|
|
|
### Troubleshooting
|
|
|
|
**Model returns 429 "No deployments available"**
|
|
All deployments for that model group are in cooldown (usually from a transient upstream error). Restart litellm to clear:
|
|
```bash
|
|
docker restart litellm
|
|
```
|
|
|
|
**Model returns upstream subscription error**
|
|
The API key in use does not have access to that model. Check the provider's plan. The model will stay in cooldown until restarted; consider removing it from the DB if access is not expected.
|
|
|
|
**Duplicate model entries**
|
|
Caused by running `litellm-init` multiple times. Run `./litellm-dedup.sh` to clean up. The health-check timer also auto-deduplicates when `DEDUP=1` (default).
|
|
|
|
## Adding a New Instance
|
|
|
|
1. Add an entry to `ansible/inventory.yml`
|
|
2. Create `ansible/host_vars/<name>.yml` with VM and OpenClaw variables (copy `host_vars/zap.yml` as a template)
|
|
3. Run the four provisioning steps above
|
|
4. Add the instance to `~/.claude/state/openclaw-instances.json`
|
|
5. Add a backup timer: copy `~/.config/systemd/user/openclaw-backup.{service,timer}`, update the instance name, reload
|
|
|
|
## Ansible Role Reference
|
|
|
|
### `vm` role (`roles/vm/`)
|
|
|
|
Provisions the KVM/libvirt VM on the hypervisor host. Variables (set in `host_vars`):
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `vm_domain` | Libvirt domain name | `"zap [claw]"` |
|
|
| `vm_hostname` | Guest hostname | `zap` |
|
|
| `vm_memory_mib` | RAM in MiB | `3072` |
|
|
| `vm_vcpus` | vCPU count | `2` |
|
|
| `vm_disk_path` | qcow2 path on host | `/var/lib/libvirt/images/claw.qcow2` |
|
|
| `vm_disk_size` | Disk size | `60G` |
|
|
| `vm_mac` | Network MAC address | `52:54:00:01:00:71` |
|
|
| `vm_ip` | Static DHCP IP | `192.168.122.182` |
|
|
| `vm_virtiofs_source` | Host path to share | `/home/will/lab/swarm` |
|
|
| `vm_virtiofs_tag` | Mount tag in guest | `swarm` |
|
|
|
|
### `openclaw` role (`roles/openclaw/`)
|
|
|
|
Upstream role from [openclaw-ansible](https://github.com/openclaw/openclaw-ansible). Installs and configures OpenClaw on the guest. Key variables:
|
|
|
|
| Variable | Value |
|
|
|----------|-------|
|
|
| `openclaw_install_mode` | `release` |
|
|
| `openclaw_ssh_keys` | will's public key |
|
|
|
|
### `customize.yml` playbook
|
|
|
|
Post-provision tweaks applied after the upstream role:
|
|
- `vm.swappiness = 10`
|
|
- `/etc/fstab` entry for virtiofs `swarm` share
|
|
- `loginctl enable-linger openclaw`
|