# Plan: Improve pi50 (Control Plane) Resource Usage

## Problem Summary

pi50 (control plane) is running at **73% CPU / 81% memory** while worker nodes have significant headroom:
- pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained)
- pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity)

**Root cause**: pi50 has **NO control-plane taint**, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51.

## Current State

| Node | Role | CPUs | Memory | CPU Used | Mem Used | Pods |
|------|------|------|--------|----------|----------|------|
| pi50 | control-plane | 4 | 8GB | 73% | 81% | ~85 |
| pi3 | worker | 4 | 800MB | 7% | 65% | 13 |
| pi51 | worker | 4 | 8GB | 18% | 64% | 38 |

## Recommended Approach

### Option A: Add PreferNoSchedule Taint (Recommended)

Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain.

```bash
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
```

**Pros:**
- Non-disruptive - existing pods continue running
- New pods will prefer pi51/pi3
- Gradual rebalancing as pods are recreated
- Easy to remove if needed

**Cons:**
- Won't immediately reduce load
- Existing pods stay where they are

### Option B: Move Heavy Workloads Immediately

Identify and relocate the heaviest workloads from pi50 to pi51:

**Top CPU consumers on pi50:**
1. ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster)
2. Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage)
3. ai-stack workloads (ollama, litellm, open-webui, etc.)

**Candidates to move to pi51:**
- `ai-stack/ollama` - can run on any node with storage
- `ai-stack/litellm` - stateless, can move
- `ai-stack/open-webui` - can move
- `ai-stack/claude-code`, `codex`, `gemini-cli`, `opencode` - can move
- `minio` - can move (uses PVC)
- `pihole2` - can move

**Method**: Add `nodeSelector` or `nodeAffinity` to deployments:
```yaml
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.io/hostname: pi51
```

Or use anti-affinity to avoid pi50:
```yaml
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: DoesNotExist
```

### Option C: Combined Approach (Best)

1. Add `PreferNoSchedule` taint to pi50 (prevents future imbalance)
2. Immediately move 2-3 heaviest moveable workloads to pi51
3. Let remaining workloads naturally migrate over time

## Execution Steps

### Step 1: Add taint to pi50
```bash
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
```

### Step 2: Verify existing workloads still running
```bash
kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running
```

### Step 3: Move heavy ai-stack workloads (optional, for immediate relief)

For each deployment to move, patch with node anti-affinity or selector:
```bash
kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}'
```

Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set):
```bash
kubectl delete pod -n ai-stack <pod-name>
```

### Step 4: Monitor
```bash
kubectl top nodes
```

## Workloads That MUST Stay on pi50

- `kube-system/*` - Core cluster components
- `longhorn-system/csi-*` - Storage controllers
- `longhorn-system/longhorn-driver-deployer` - Storage management
- `local-path-storage/*` - Local storage provisioner

## Expected Outcome

After changes:
- pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services)
- pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads)
- New pods prefer pi51 automatically

## Risks

- **Low**: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50
- **Low**: Moving workloads may cause brief service interruption during pod recreation
- **Note**: pi3 cannot absorb much due to 800MB RAM limit

## Selected Approach: A + B (Combined)

User selected combined approach:
1. Add `PreferNoSchedule` taint to pi50
2. Move heavy ai-stack workloads to pi51 immediately

## Execution Plan

### Phase 1: Add Taint
```bash
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
```

### Phase 2: Move Heavy Workloads to pi51

Target workloads (heaviest on pi50):
- `ai-stack/ollama`
- `ai-stack/open-webui`
- `ai-stack/litellm`
- `ai-stack/claude-code`
- `ai-stack/codex`
- `ai-stack/gemini-cli`
- `ai-stack/opencode`
- `ai-stack/searxng`
- `minio/minio`

Method: Delete pods to trigger rescheduling (taint will push them to pi51):
```bash
kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama
# etc for each workload
```

### Phase 3: Verify
```bash
kubectl top nodes
kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm"
```