Document LiteLLM setup, model registration, and maintenance

Add LiteLLM section to README covering: service startup, credential and model registration (including FORCE=1 for re-runs), adding new models via API, maintenance scripts, systemd timer, and a troubleshooting guide for the 429/cooldown and duplicate-entry failure modes encountered in practice. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 13:33:22 -07:00
parent c94bbe5de8
commit 727069e16d
1 changed files with 81 additions and 0 deletions
@@ -18,6 +18,12 @@ swarm/
 │       ├── openclaw/           # Upstream role (from openclaw-ansible)
 │       └── vm/                 # VM provisioning role (local)
 ├── openclaw/                   # Live mirror of guest ~/.openclaw/
 ├── docker-compose.yaml         # LiteLLM + supporting services
 ├── litellm-config.yaml         # LiteLLM static config
 ├── litellm-init-credentials.sh # Register API keys into LiteLLM DB
 ├── litellm-init-models.sh      # Register models into LiteLLM DB (idempotent)
 ├── litellm-dedup.sh            # Remove duplicate model DB entries
 ├── litellm-health-check.sh     # Liveness check + auto-dedup (run by systemd timer)
 ├── backup-openclaw-vm.sh       # Sync openclaw/ + upload to MinIO
 ├── restore-openclaw-vm.sh      # Full VM redeploy from scratch
 └── README.md                   # This file
@@ -147,6 +153,81 @@ To list available archives:
 aws s3 ls s3://zap/backups/
 ```
 ## LiteLLM
 LiteLLM runs as a Docker service (`litellm`, port 18804) backed by a Postgres database (`litellm-db`). It acts as a unified OpenAI-compatible proxy over Anthropic, OpenAI, Gemini, ZAI/GLM, and GitHub Copilot.
 ### Starting
 ```bash
 cd ~/lab/swarm
 docker compose --profile api up -d
 ```
 ### Credentials and model registration
 On first start, `litellm-init` registers API credentials and all models into the DB. It is idempotent — re-running it when models already exist is a no-op (guarded by a `gpt-4o` sentinel check). To force a re-run (e.g. to add newly-added models to the script):
 ```bash
 docker compose --profile api run --rm \
  -e FORCE=1 litellm-init
 ```
 ### Adding a new model
 1. Add an `add_model` (or `add_copilot_model`) call to `litellm-init-models.sh`
 2. Register it live via the API (no restart needed):
 ```bash
 source .env
 curl -X POST http://localhost:18804/model/new \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model_name":"<name>","litellm_params":{"model":"<provider>/<model>","api_key":"os.environ/<KEY_VAR>"}}'
 ```
 ### Maintenance scripts
 | Script | Purpose |
 |--------|---------|
 | `litellm-dedup.sh` | Remove duplicate model DB entries (run `--dry-run` to preview) |
 | `litellm-health-check.sh` | Liveness check + auto-dedup; run by systemd timer |
 ```bash
 # Manual dedup
 ./litellm-dedup.sh
 # Manual health check
 ./litellm-health-check.sh
 # Check maintenance log
 tail -f litellm-maintenance.log
 ```
 ### Systemd timer
 `litellm-health-check.timer` runs every 6 hours (user session, enabled at install). It checks liveness (restarting the container if unresponsive) and removes any duplicate model entries.
 ```bash
 systemctl --user status litellm-health-check.timer
 systemctl --user list-timers litellm-health-check.timer
 journalctl --user -u litellm-health-check.service -n 20
 ```
 ### Troubleshooting
 **Model returns 429 "No deployments available"**
 All deployments for that model group are in cooldown (usually from a transient upstream error). Restart litellm to clear:
 ```bash
 docker restart litellm
 ```
 **Model returns upstream subscription error**
 The API key in use does not have access to that model. Check the provider's plan. The model will stay in cooldown until restarted; consider removing it from the DB if access is not expected.
 **Duplicate model entries**
 Caused by running `litellm-init` multiple times. Run `./litellm-dedup.sh` to clean up. The health-check timer also auto-deduplicates when `DEDUP=1` (default).
 ## Adding a New Instance
 1. Add an entry to `ansible/inventory.yml`