Files
claude-code/plans/valiant-hugging-dahl.md
OpenCode Test f9e9be62bc Add pi50 resource optimization plan, mark monitoring design complete
- New plan: Improve pi50 control plane resource usage
- Completed: Workstation monitoring design status file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-05 13:00:57 -08:00

4.9 KiB

Plan: Improve pi50 (Control Plane) Resource Usage

Problem Summary

pi50 (control plane) is running at 73% CPU / 81% memory while worker nodes have significant headroom:

  • pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained)
  • pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity)

Root cause: pi50 has NO control-plane taint, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51.

Current State

Node Role CPUs Memory CPU Used Mem Used Pods
pi50 control-plane 4 8GB 73% 81% ~85
pi3 worker 4 800MB 7% 65% 13
pi51 worker 4 8GB 18% 64% 38

Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain.

kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule

Pros:

  • Non-disruptive - existing pods continue running
  • New pods will prefer pi51/pi3
  • Gradual rebalancing as pods are recreated
  • Easy to remove if needed

Cons:

  • Won't immediately reduce load
  • Existing pods stay where they are

Option B: Move Heavy Workloads Immediately

Identify and relocate the heaviest workloads from pi50 to pi51:

Top CPU consumers on pi50:

  1. ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster)
  2. Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage)
  3. ai-stack workloads (ollama, litellm, open-webui, etc.)

Candidates to move to pi51:

  • ai-stack/ollama - can run on any node with storage
  • ai-stack/litellm - stateless, can move
  • ai-stack/open-webui - can move
  • ai-stack/claude-code, codex, gemini-cli, opencode - can move
  • minio - can move (uses PVC)
  • pihole2 - can move

Method: Add nodeSelector or nodeAffinity to deployments:

spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/hostname: pi51

Or use anti-affinity to avoid pi50:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist

Option C: Combined Approach (Best)

  1. Add PreferNoSchedule taint to pi50 (prevents future imbalance)
  2. Immediately move 2-3 heaviest moveable workloads to pi51
  3. Let remaining workloads naturally migrate over time

Execution Steps

Step 1: Add taint to pi50

kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule

Step 2: Verify existing workloads still running

kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running

Step 3: Move heavy ai-stack workloads (optional, for immediate relief)

For each deployment to move, patch with node anti-affinity or selector:

kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}'

Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set):

kubectl delete pod -n ai-stack <pod-name>

Step 4: Monitor

kubectl top nodes

Workloads That MUST Stay on pi50

  • kube-system/* - Core cluster components
  • longhorn-system/csi-* - Storage controllers
  • longhorn-system/longhorn-driver-deployer - Storage management
  • local-path-storage/* - Local storage provisioner

Expected Outcome

After changes:

  • pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services)
  • pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads)
  • New pods prefer pi51 automatically

Risks

  • Low: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50
  • Low: Moving workloads may cause brief service interruption during pod recreation
  • Note: pi3 cannot absorb much due to 800MB RAM limit

Selected Approach: A + B (Combined)

User selected combined approach:

  1. Add PreferNoSchedule taint to pi50
  2. Move heavy ai-stack workloads to pi51 immediately

Execution Plan

Phase 1: Add Taint

kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule

Phase 2: Move Heavy Workloads to pi51

Target workloads (heaviest on pi50):

  • ai-stack/ollama
  • ai-stack/open-webui
  • ai-stack/litellm
  • ai-stack/claude-code
  • ai-stack/codex
  • ai-stack/gemini-cli
  • ai-stack/opencode
  • ai-stack/searxng
  • minio/minio

Method: Delete pods to trigger rescheduling (taint will push them to pi51):

kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama
# etc for each workload

Phase 3: Verify

kubectl top nodes
kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm"