- New plan: Improve pi50 control plane resource usage - Completed: Workstation monitoring design status file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.9 KiB
Plan: Improve pi50 (Control Plane) Resource Usage
Problem Summary
pi50 (control plane) is running at 73% CPU / 81% memory while worker nodes have significant headroom:
- pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained)
- pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity)
Root cause: pi50 has NO control-plane taint, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51.
Current State
| Node | Role | CPUs | Memory | CPU Used | Mem Used | Pods |
|---|---|---|---|---|---|---|
| pi50 | control-plane | 4 | 8GB | 73% | 81% | ~85 |
| pi3 | worker | 4 | 800MB | 7% | 65% | 13 |
| pi51 | worker | 4 | 8GB | 18% | 64% | 38 |
Recommended Approach
Option A: Add PreferNoSchedule Taint (Recommended)
Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain.
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
Pros:
- Non-disruptive - existing pods continue running
- New pods will prefer pi51/pi3
- Gradual rebalancing as pods are recreated
- Easy to remove if needed
Cons:
- Won't immediately reduce load
- Existing pods stay where they are
Option B: Move Heavy Workloads Immediately
Identify and relocate the heaviest workloads from pi50 to pi51:
Top CPU consumers on pi50:
- ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster)
- Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage)
- ai-stack workloads (ollama, litellm, open-webui, etc.)
Candidates to move to pi51:
ai-stack/ollama- can run on any node with storageai-stack/litellm- stateless, can moveai-stack/open-webui- can moveai-stack/claude-code,codex,gemini-cli,opencode- can moveminio- can move (uses PVC)pihole2- can move
Method: Add nodeSelector or nodeAffinity to deployments:
spec:
template:
spec:
nodeSelector:
kubernetes.io/hostname: pi51
Or use anti-affinity to avoid pi50:
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
Option C: Combined Approach (Best)
- Add
PreferNoScheduletaint to pi50 (prevents future imbalance) - Immediately move 2-3 heaviest moveable workloads to pi51
- Let remaining workloads naturally migrate over time
Execution Steps
Step 1: Add taint to pi50
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
Step 2: Verify existing workloads still running
kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running
Step 3: Move heavy ai-stack workloads (optional, for immediate relief)
For each deployment to move, patch with node anti-affinity or selector:
kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}'
Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set):
kubectl delete pod -n ai-stack <pod-name>
Step 4: Monitor
kubectl top nodes
Workloads That MUST Stay on pi50
kube-system/*- Core cluster componentslonghorn-system/csi-*- Storage controllerslonghorn-system/longhorn-driver-deployer- Storage managementlocal-path-storage/*- Local storage provisioner
Expected Outcome
After changes:
- pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services)
- pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads)
- New pods prefer pi51 automatically
Risks
- Low: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50
- Low: Moving workloads may cause brief service interruption during pod recreation
- Note: pi3 cannot absorb much due to 800MB RAM limit
Selected Approach: A + B (Combined)
User selected combined approach:
- Add
PreferNoScheduletaint to pi50 - Move heavy ai-stack workloads to pi51 immediately
Execution Plan
Phase 1: Add Taint
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
Phase 2: Move Heavy Workloads to pi51
Target workloads (heaviest on pi50):
ai-stack/ollamaai-stack/open-webuiai-stack/litellmai-stack/claude-codeai-stack/codexai-stack/gemini-cliai-stack/opencodeai-stack/searxngminio/minio
Method: Delete pods to trigger rescheduling (taint will push them to pi51):
kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama
# etc for each workload
Phase 3: Verify
kubectl top nodes
kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm"