Add workstation monitoring design

2026-01-05 01:31:10 -08:00
parent f3cb082c36
commit 62050faedc
1 changed files with 296 additions and 0 deletions
--- a/docs/plans/2026-01-05-workstation-monitoring-design.md
+++ b/docs/plans/2026-01-05-workstation-monitoring-design.md
@@ -0,0 +1,296 @@
+# Workstation Monitoring Design
+
+## Overview
+
+Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
+
+**Reference:** Future consideration `fc-001` (workstation monitoring)
+
+## Current Infrastructure
+
+- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
+- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
+- **Network:** Direct network connectivity between workstation and cluster nodes
+- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured
+
+## Architecture
+
+### Components
+
+```
+┌─────────────────┐     HTTP/9100      ┌──────────────────────┐
+│   Workstation   │ ──────────────────> │   K8s Prometheus    │
+│   (willlaptop)  │   scrape every 15s │   (monitoring ns)   │
+│                 │                    │                      │
+│  node_exporter  │                    │  workstation rules  │
+│  systemd service│                    │  + scrape config    │
+└─────────────────┘                    └──────────────────────┘
+                                                      │
+                                                      v
+                                         ┌──────────────────────┐
+                                         │  Alertmanager        │
+                                         │  (existing setup)    │
+                                         │  unified routing     │
+                                         └──────────────────────┘
+```
+
+### Data Flow
+
+1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
+2. **Prometheus** scrapes metrics every 15s via static target configuration
+3. **PrometheusRule** evaluates workstation-specific alert rules
+4. **Alertmanager** routes alerts to existing notification channels
+
+## Workstation Deployment
+
+### node_exporter Service
+
+**Installation:**
+```bash
+pacman -S prometheus-node-exporter
+```
+
+**Systemd Configuration:**
+- Service: `node_exporter.service`
+- User: `node_exporter` (created by package)
+- Listen address: `0.0.0.0:9100`
+- Restart policy: `always` with 10s delay
+- Logging: systemd journal (`journalctl -u node_exporter`)
+
+**ExecStart flags:**
+```bash
+/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
+```
+
+Excludes system mounts to reduce noise.
+
+**Firewall Configuration:**
+- Allow TCP 9100 from cluster nodes
+- Use ufw or iptables to restrict access
+
+**Metrics Collected:**
+All default collectors except resource-intensive ones:
+- CPU, memory, filesystem, network
+- System stats (uptime, load average, systemd)
+- Thermal (if available on hardware)
+- Disk I/O
+
+## Prometheus Configuration
+
+### Static Scrape Target
+
+**Job configuration:**
+- Job name: `workstation/willlaptop`
+- Target: `willlaptop:9100` (DNS resolution) or workstation IP
+- Scrape interval: `15s` (matches cluster node_exporter)
+- Scrape timeout: `10s`
+- Metrics path: `/metrics`
+- Honor labels: `true`
+
+**Relabeling rules:**
+- Add `env: "workstation"` label for identification
+- Preserve `instance: "willlaptop"` from target
+
+**Integration:**
+Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
+- PrometheusRule with additional scrape config, or
+- Direct modification of Prometheus configuration
+
+## Alert Rules
+
+### PrometheusRule Resource
+
+**Namespace:** `monitoring`
+**Kind:** `PrometheusRule`
+**Labels:** Standard discovery labels for Prometheus operator
+
+### Alert Categories
+
+#### Critical Alerts (Paging)
+
+1. **WorkstationDiskSpaceCritical**
+   - Condition: `<5%` free on any mounted filesystem
+   - Duration: 5m
+   - Severity: critical
+
+2. **WorkstationMemoryCritical**
+   - Condition: `>95%` memory usage
+   - Duration: 5m
+   - Severity: critical
+
+3. **WorkstationCPUCritical**
+   - Condition: `>95%` CPU usage
+   - Duration: 10m
+   - Severity: critical
+
+4. **WorkstationSystemdFailed**
+   - Condition: Failed systemd units detected
+   - Duration: 5m
+   - Severity: critical
+
+#### Warning Alerts (Email/Slack)
+
+1. **WorkstationDiskSpaceWarning**
+   - Condition: `<10%` free on any mounted filesystem
+   - Duration: 10m
+   - Severity: warning
+
+2. **WorkstationMemoryWarning**
+   - Condition: `>85%` memory usage
+   - Duration: 10m
+   - Severity: warning
+
+3. **WorkstationCPUWarning**
+   - Condition: `>80%` CPU usage
+   - Duration: 15m
+   - Severity: warning
+
+4. **WorkstationLoadHigh**
+   - Condition: 5m load average > # CPU cores
+   - Duration: 10m
+   - Severity: warning
+
+5. **WorkstationDiskInodeWarning**
+   - Condition: `<10%` inodes free
+   - Duration: 10m
+   - Severity: warning
+
+6. **WorkstationNetworkErrors**
+   - Condition: High packet loss or error rate
+   - Duration: 10m
+   - Severity: warning
+
+#### Info Alerts (Log Only)
+
+1. **WorkstationDiskSpaceInfo**
+   - Condition: `<20%` free on any mounted filesystem
+   - Duration: 15m
+   - Severity: info
+
+2. **WorkstationUptime**
+   - Condition: System uptime metric (recording rule)
+   - Severity: info
+
+### Alert Annotations
+
+Each alert includes:
+- `summary`: Brief description
+- `description`: Detailed explanation with metric values
+- `runbook_url`: Link to troubleshooting documentation (if available)
+
+## Versioning
+
+### Repository Structure
+
+```
+~/.claude/repos/homelab/charts/willlaptop-monitoring/
+├── prometheus-rules.yaml         # PrometheusRule for workstation alerts
+├── values.yaml                    # Configuration values
+└── README.md                     # Documentation
+```
+
+### Values.yaml Configuration
+
+Configurable parameters:
+```yaml
+workstation:
+  hostname: willlaptop
+  ip: <workstation_ip>  # optional, fallback to DNS
+
+scrape:
+  interval: 15s
+  timeout: 10s
+
+alerts:
+  disk:
+    critical_percent: 5
+    warning_percent: 10
+    info_percent: 20
+  memory:
+    critical_percent: 95
+    warning_percent: 85
+  cpu:
+    critical_percent: 95
+    critical_duration: 10m
+    warning_percent: 80
+    warning_duration: 15m
+```
+
+### Integration with ArgoCD
+
+Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
+
+## Testing and Verification
+
+### Phase 1 - Workstation Deployment
+
+1. Verify node_exporter installation:
+   ```bash
+   pacman -Q prometheus-node-exporter
+   ```
+
+2. Check systemd service status:
+   ```bash
+   systemctl status node_exporter
+   ```
+
+3. Verify metrics endpoint locally:
+   ```bash
+   curl http://localhost:9100/metrics | head -20
+   ```
+
+4. Test accessibility from cluster:
+   ```bash
+   kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
+   ```
+
+### Phase 2 - Prometheus Integration
+
+1. Verify Prometheus target:
+   - Access Prometheus UI → Targets → workstation/willlaptop
+   - Confirm target is UP
+
+2. Verify metric ingestion:
+   ```bash
+   # Query in Prometheus UI
+   node_up{instance="willlaptop"}
+   ```
+
+3. Verify label injection:
+   - Confirm `env="workstation"` label appears on metrics
+
+### Phase 3 - Alert Verification
+
+1. Review PrometheusRule:
+   ```bash
+   kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
+   ```
+
+2. Verify rule evaluation:
+   - Access Prometheus UI → Rules
+   - Confirm workstation rules are active
+
+3. Test critical alert:
+   - Temporarily trigger a low disk alert (or simulate)
+   - Verify alert fires in Prometheus UI
+
+4. Verify Alertmanager integration:
+   - Check Alertmanager UI → Alerts
+   - Confirm workstation alerts are received
+
+## Success Criteria
+
+- [ ] node_exporter running on workstation
+- [ ] Metrics accessible from cluster nodes
+- [ ] Prometheus scraping workstation metrics
+- [ ] Alert rules evaluated and firing correctly
+- [ ] Alerts routing through Alertmanager
+- [ ] Configuration versioned in homelab repository
+- [ ] Documentation complete
+
+## Future Enhancements
+
+- Grafana dashboards for workstation metrics
+- Alert tuning based on observed patterns
+- Additional collectors (e.g., temperature sensors if available)
+- Integration with morning-report skill for health status