# Plan: Improve pi50 (Control Plane) Resource Usage ## Problem Summary pi50 (control plane) is running at **73% CPU / 81% memory** while worker nodes have significant headroom: - pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained) - pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity) **Root cause**: pi50 has **NO control-plane taint**, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51. ## Current State | Node | Role | CPUs | Memory | CPU Used | Mem Used | Pods | |------|------|------|--------|----------|----------|------| | pi50 | control-plane | 4 | 8GB | 73% | 81% | ~85 | | pi3 | worker | 4 | 800MB | 7% | 65% | 13 | | pi51 | worker | 4 | 8GB | 18% | 64% | 38 | ## Recommended Approach ### Option A: Add PreferNoSchedule Taint (Recommended) Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain. ```bash kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule ``` **Pros:** - Non-disruptive - existing pods continue running - New pods will prefer pi51/pi3 - Gradual rebalancing as pods are recreated - Easy to remove if needed **Cons:** - Won't immediately reduce load - Existing pods stay where they are ### Option B: Move Heavy Workloads Immediately Identify and relocate the heaviest workloads from pi50 to pi51: **Top CPU consumers on pi50:** 1. ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster) 2. Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage) 3. ai-stack workloads (ollama, litellm, open-webui, etc.) **Candidates to move to pi51:** - `ai-stack/ollama` - can run on any node with storage - `ai-stack/litellm` - stateless, can move - `ai-stack/open-webui` - can move - `ai-stack/claude-code`, `codex`, `gemini-cli`, `opencode` - can move - `minio` - can move (uses PVC) - `pihole2` - can move **Method**: Add `nodeSelector` or `nodeAffinity` to deployments: ```yaml spec: template: spec: nodeSelector: kubernetes.io/hostname: pi51 ``` Or use anti-affinity to avoid pi50: ```yaml spec: template: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-role.kubernetes.io/control-plane operator: DoesNotExist ``` ### Option C: Combined Approach (Best) 1. Add `PreferNoSchedule` taint to pi50 (prevents future imbalance) 2. Immediately move 2-3 heaviest moveable workloads to pi51 3. Let remaining workloads naturally migrate over time ## Execution Steps ### Step 1: Add taint to pi50 ```bash kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule ``` ### Step 2: Verify existing workloads still running ```bash kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running ``` ### Step 3: Move heavy ai-stack workloads (optional, for immediate relief) For each deployment to move, patch with node anti-affinity or selector: ```bash kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}' ``` Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set): ```bash kubectl delete pod -n ai-stack ``` ### Step 4: Monitor ```bash kubectl top nodes ``` ## Workloads That MUST Stay on pi50 - `kube-system/*` - Core cluster components - `longhorn-system/csi-*` - Storage controllers - `longhorn-system/longhorn-driver-deployer` - Storage management - `local-path-storage/*` - Local storage provisioner ## Expected Outcome After changes: - pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services) - pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads) - New pods prefer pi51 automatically ## Risks - **Low**: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50 - **Low**: Moving workloads may cause brief service interruption during pod recreation - **Note**: pi3 cannot absorb much due to 800MB RAM limit ## Selected Approach: A + B (Combined) User selected combined approach: 1. Add `PreferNoSchedule` taint to pi50 2. Move heavy ai-stack workloads to pi51 immediately ## Execution Plan ### Phase 1: Add Taint ```bash kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule ``` ### Phase 2: Move Heavy Workloads to pi51 Target workloads (heaviest on pi50): - `ai-stack/ollama` - `ai-stack/open-webui` - `ai-stack/litellm` - `ai-stack/claude-code` - `ai-stack/codex` - `ai-stack/gemini-cli` - `ai-stack/opencode` - `ai-stack/searxng` - `minio/minio` Method: Delete pods to trigger rescheduling (taint will push them to pi51): ```bash kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama # etc for each workload ``` ### Phase 3: Verify ```bash kubectl top nodes kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm" ```