diff --git a/docs/plans/.completed/2026-01-05-workstation-monitoring-design.status b/docs/plans/.completed/2026-01-05-workstation-monitoring-design.status new file mode 100644 index 0000000..f4a6dd6 --- /dev/null +++ b/docs/plans/.completed/2026-01-05-workstation-monitoring-design.status @@ -0,0 +1,17 @@ +{ + "plan": "2026-01-05-workstation-monitoring-design.md", + "status": "COMPLETE", + "completed_at": "2026-01-05T14:09:00Z", + "implementation": { + "node_exporter": "installed and running (v1.10.2-1)", + "scrape_config": "deployed (workstation-scrape)", + "prometheus_rule": "deployed (workstation-alerts, 12 rules)", + "prometheus_target": "UP and scraping", + "git_commit": "9d17ac8", + "network_solution": "Tailscale (100.90.159.78:9100)" + }, + "verification": { + "all_success_criteria_met": true, + "verified_at": "2026-01-05T14:09:19Z" + } +} diff --git a/plans/valiant-hugging-dahl.md b/plans/valiant-hugging-dahl.md new file mode 100644 index 0000000..19e4ac9 --- /dev/null +++ b/plans/valiant-hugging-dahl.md @@ -0,0 +1,171 @@ +# Plan: Improve pi50 (Control Plane) Resource Usage + +## Problem Summary + +pi50 (control plane) is running at **73% CPU / 81% memory** while worker nodes have significant headroom: +- pi3: 7% CPU / 65% memory (but only 800MB RAM - memory constrained) +- pi51: 18% CPU / 64% memory (8GB RAM - plenty of capacity) + +**Root cause**: pi50 has **NO control-plane taint**, so the scheduler treats it as a general worker node. It currently runs ~85 pods vs 38 on pi51. + +## Current State + +| Node | Role | CPUs | Memory | CPU Used | Mem Used | Pods | +|------|------|------|--------|----------|----------|------| +| pi50 | control-plane | 4 | 8GB | 73% | 81% | ~85 | +| pi3 | worker | 4 | 800MB | 7% | 65% | 13 | +| pi51 | worker | 4 | 8GB | 18% | 64% | 38 | + +## Recommended Approach + +### Option A: Add PreferNoSchedule Taint (Recommended) + +Add a soft taint to pi50 that tells the scheduler to prefer other nodes for new workloads, while allowing existing pods to remain. + +```bash +kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule +``` + +**Pros:** +- Non-disruptive - existing pods continue running +- New pods will prefer pi51/pi3 +- Gradual rebalancing as pods are recreated +- Easy to remove if needed + +**Cons:** +- Won't immediately reduce load +- Existing pods stay where they are + +### Option B: Move Heavy Workloads Immediately + +Identify and relocate the heaviest workloads from pi50 to pi51: + +**Top CPU consumers on pi50:** +1. ArgoCD application-controller (157m CPU, 364Mi) - should stay (manages cluster) +2. Longhorn instance-manager (139m CPU, 707Mi) - must stay (storage) +3. ai-stack workloads (ollama, litellm, open-webui, etc.) + +**Candidates to move to pi51:** +- `ai-stack/ollama` - can run on any node with storage +- `ai-stack/litellm` - stateless, can move +- `ai-stack/open-webui` - can move +- `ai-stack/claude-code`, `codex`, `gemini-cli`, `opencode` - can move +- `minio` - can move (uses PVC) +- `pihole2` - can move + +**Method**: Add `nodeSelector` or `nodeAffinity` to deployments: +```yaml +spec: + template: + spec: + nodeSelector: + kubernetes.io/hostname: pi51 +``` + +Or use anti-affinity to avoid pi50: +```yaml +spec: + template: + spec: + affinity: + nodeAffinity: + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 100 + preference: + matchExpressions: + - key: node-role.kubernetes.io/control-plane + operator: DoesNotExist +``` + +### Option C: Combined Approach (Best) + +1. Add `PreferNoSchedule` taint to pi50 (prevents future imbalance) +2. Immediately move 2-3 heaviest moveable workloads to pi51 +3. Let remaining workloads naturally migrate over time + +## Execution Steps + +### Step 1: Add taint to pi50 +```bash +kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule +``` + +### Step 2: Verify existing workloads still running +```bash +kubectl get pods -A -o wide --field-selector spec.nodeName=pi50 | grep -v Running +``` + +### Step 3: Move heavy ai-stack workloads (optional, for immediate relief) + +For each deployment to move, patch with node anti-affinity or selector: +```bash +kubectl patch deployment -n ai-stack ollama --type=merge -p '{"spec":{"template":{"spec":{"nodeSelector":{"kubernetes.io/hostname":"pi51"}}}}}' +``` + +Or delete pods to trigger rescheduling (if PreferNoSchedule taint is set): +```bash +kubectl delete pod -n ai-stack +``` + +### Step 4: Monitor +```bash +kubectl top nodes +``` + +## Workloads That MUST Stay on pi50 + +- `kube-system/*` - Core cluster components +- `longhorn-system/csi-*` - Storage controllers +- `longhorn-system/longhorn-driver-deployer` - Storage management +- `local-path-storage/*` - Local storage provisioner + +## Expected Outcome + +After changes: +- pi50: ~50-60% CPU, ~65-70% memory (control plane + essential services) +- pi51: ~40-50% CPU, ~70-75% memory (absorbs application workloads) +- New pods prefer pi51 automatically + +## Risks + +- **Low**: PreferNoSchedule is a soft taint - pods with tolerations can still schedule on pi50 +- **Low**: Moving workloads may cause brief service interruption during pod recreation +- **Note**: pi3 cannot absorb much due to 800MB RAM limit + +## Selected Approach: A + B (Combined) + +User selected combined approach: +1. Add `PreferNoSchedule` taint to pi50 +2. Move heavy ai-stack workloads to pi51 immediately + +## Execution Plan + +### Phase 1: Add Taint +```bash +kubectl taint nodes pi50 node-role.kubernetes.io/control-plane=:PreferNoSchedule +``` + +### Phase 2: Move Heavy Workloads to pi51 + +Target workloads (heaviest on pi50): +- `ai-stack/ollama` +- `ai-stack/open-webui` +- `ai-stack/litellm` +- `ai-stack/claude-code` +- `ai-stack/codex` +- `ai-stack/gemini-cli` +- `ai-stack/opencode` +- `ai-stack/searxng` +- `minio/minio` + +Method: Delete pods to trigger rescheduling (taint will push them to pi51): +```bash +kubectl delete pod -n ai-stack -l app.kubernetes.io/name=ollama +# etc for each workload +``` + +### Phase 3: Verify +```bash +kubectl top nodes +kubectl get pods -A -o wide | grep -E "ollama|open-webui|litellm" +```