Add pi50 resource optimization plan, mark monitoring design complete
- New plan: Improve pi50 control plane resource usage - Completed: Workstation monitoring design status file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,17 @@
|
|||||||
|
{
|
||||||
|
"plan": "2026-01-05-workstation-monitoring-design.md",
|
||||||
|
"status": "COMPLETE",
|
||||||
|
"completed_at": "2026-01-05T14:09:00Z",
|
||||||
|
"implementation": {
|
||||||
|
"node_exporter": "installed and running (v1.10.2-1)",
|
||||||
|
"scrape_config": "deployed (workstation-scrape)",
|
||||||
|
"prometheus_rule": "deployed (workstation-alerts, 12 rules)",
|
||||||
|
"prometheus_target": "UP and scraping",
|
||||||
|
"git_commit": "9d17ac8",
|
||||||
|
"network_solution": "Tailscale (100.90.159.78:9100)"
|
||||||
|
},
|
||||||
|
"verification": {
|
||||||
|
"all_success_criteria_met": true,
|
||||||
|
"verified_at": "2026-01-05T14:09:19Z"
|
||||||
|
}
|
||||||
|
}
|
||||||
171
plans/valiant-hugging-dahl.md
Normal file
171
plans/valiant-hugging-dahl.md
Normal file
@@ -0,0 +1,171 @@
|
|||||||
|
# Plan: Improve pi50 (Control Plane) Resource Usage
|
||||||
|
|
||||||
|
## Problem Summary
|
||||||
|
|
||||||
|
pi50 (control plane) is running at **73% CPU / 81% memory** while worker nodes have significant headroom:
|
||||||
|
- pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained)
|
||||||
|
- pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity)
|
||||||
|
|
||||||
|
**Root cause**: pi50 has **NO control-plane taint**, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
| Node | Role | CPUs | Memory | CPU Used | Mem Used | Pods |
|
||||||
|
|------|------|------|--------|----------|----------|------|
|
||||||
|
| pi50 | control-plane | 4 | 8GB | 73% | 81% | ~85 |
|
||||||
|
| pi3 | worker | 4 | 800MB | 7% | 65% | 13 |
|
||||||
|
| pi51 | worker | 4 | 8GB | 18% | 64% | 38 |
|
||||||
|
|
||||||
|
## Recommended Approach
|
||||||
|
|
||||||
|
### Option A: Add PreferNoSchedule Taint (Recommended)
|
||||||
|
|
||||||
|
Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- Non-disruptive - existing pods continue running
|
||||||
|
- New pods will prefer pi51/pi3
|
||||||
|
- Gradual rebalancing as pods are recreated
|
||||||
|
- Easy to remove if needed
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
- Won't immediately reduce load
|
||||||
|
- Existing pods stay where they are
|
||||||
|
|
||||||
|
### Option B: Move Heavy Workloads Immediately
|
||||||
|
|
||||||
|
Identify and relocate the heaviest workloads from pi50 to pi51:
|
||||||
|
|
||||||
|
**Top CPU consumers on pi50:**
|
||||||
|
1. ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster)
|
||||||
|
2. Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage)
|
||||||
|
3. ai-stack workloads (ollama, litellm, open-webui, etc.)
|
||||||
|
|
||||||
|
**Candidates to move to pi51:**
|
||||||
|
- `ai-stack/ollama` - can run on any node with storage
|
||||||
|
- `ai-stack/litellm` - stateless, can move
|
||||||
|
- `ai-stack/open-webui` - can move
|
||||||
|
- `ai-stack/claude-code`, `codex`, `gemini-cli`, `opencode` - can move
|
||||||
|
- `minio` - can move (uses PVC)
|
||||||
|
- `pihole2` - can move
|
||||||
|
|
||||||
|
**Method**: Add `nodeSelector` or `nodeAffinity` to deployments:
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
nodeSelector:
|
||||||
|
kubernetes.io/hostname: pi51
|
||||||
|
```
|
||||||
|
|
||||||
|
Or use anti-affinity to avoid pi50:
|
||||||
|
```yaml
|
||||||
|
spec:
|
||||||
|
template:
|
||||||
|
spec:
|
||||||
|
affinity:
|
||||||
|
nodeAffinity:
|
||||||
|
preferredDuringSchedulingIgnoredDuringExecution:
|
||||||
|
- weight: 100
|
||||||
|
preference:
|
||||||
|
matchExpressions:
|
||||||
|
- key: node-role.kubernetes.io/control-plane
|
||||||
|
operator: DoesNotExist
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option C: Combined Approach (Best)
|
||||||
|
|
||||||
|
1. Add `PreferNoSchedule` taint to pi50 (prevents future imbalance)
|
||||||
|
2. Immediately move 2-3 heaviest moveable workloads to pi51
|
||||||
|
3. Let remaining workloads naturally migrate over time
|
||||||
|
|
||||||
|
## Execution Steps
|
||||||
|
|
||||||
|
### Step 1: Add taint to pi50
|
||||||
|
```bash
|
||||||
|
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Verify existing workloads still running
|
||||||
|
```bash
|
||||||
|
kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Move heavy ai-stack workloads (optional, for immediate relief)
|
||||||
|
|
||||||
|
For each deployment to move, patch with node anti-affinity or selector:
|
||||||
|
```bash
|
||||||
|
kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set):
|
||||||
|
```bash
|
||||||
|
kubectl delete pod -n ai-stack <pod-name>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Monitor
|
||||||
|
```bash
|
||||||
|
kubectl top nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workloads That MUST Stay on pi50
|
||||||
|
|
||||||
|
- `kube-system/*` - Core cluster components
|
||||||
|
- `longhorn-system/csi-*` - Storage controllers
|
||||||
|
- `longhorn-system/longhorn-driver-deployer` - Storage management
|
||||||
|
- `local-path-storage/*` - Local storage provisioner
|
||||||
|
|
||||||
|
## Expected Outcome
|
||||||
|
|
||||||
|
After changes:
|
||||||
|
- pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services)
|
||||||
|
- pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads)
|
||||||
|
- New pods prefer pi51 automatically
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
- **Low**: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50
|
||||||
|
- **Low**: Moving workloads may cause brief service interruption during pod recreation
|
||||||
|
- **Note**: pi3 cannot absorb much due to 800MB RAM limit
|
||||||
|
|
||||||
|
## Selected Approach: A + B (Combined)
|
||||||
|
|
||||||
|
User selected combined approach:
|
||||||
|
1. Add `PreferNoSchedule` taint to pi50
|
||||||
|
2. Move heavy ai-stack workloads to pi51 immediately
|
||||||
|
|
||||||
|
## Execution Plan
|
||||||
|
|
||||||
|
### Phase 1: Add Taint
|
||||||
|
```bash
|
||||||
|
kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2: Move Heavy Workloads to pi51
|
||||||
|
|
||||||
|
Target workloads (heaviest on pi50):
|
||||||
|
- `ai-stack/ollama`
|
||||||
|
- `ai-stack/open-webui`
|
||||||
|
- `ai-stack/litellm`
|
||||||
|
- `ai-stack/claude-code`
|
||||||
|
- `ai-stack/codex`
|
||||||
|
- `ai-stack/gemini-cli`
|
||||||
|
- `ai-stack/opencode`
|
||||||
|
- `ai-stack/searxng`
|
||||||
|
- `minio/minio`
|
||||||
|
|
||||||
|
Method: Delete pods to trigger rescheduling (taint will push them to pi51):
|
||||||
|
```bash
|
||||||
|
kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama
|
||||||
|
# etc for each workload
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 3: Verify
|
||||||
|
```bash
|
||||||
|
kubectl top nodes
|
||||||
|
kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm"
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user