Add workstation monitoring design

2026-01-05 01:31:10 -08:00
parent f3cb082c36
commit 62050faedc
1 changed files with 296 additions and 0 deletions
--- a/docs/plans/2026-01-05-workstation-monitoring-design.md
+++ b/docs/plans/2026-01-05-workstation-monitoring-design.md
@@ -0,0 +1,296 @@
 # Workstation Monitoring Design
 ## Overview
 Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
 **Reference:** Future consideration `fc-001` (workstation monitoring)
 ## Current Infrastructure
 - **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
 - **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
 - **Network:** Direct network connectivity between workstation and cluster nodes
 - **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured
 ## Architecture
 ### Components
 ```
 ┌─────────────────┐     HTTP/9100      ┌──────────────────────┐
 │   Workstation   │ ──────────────────> │   K8s Prometheus    │
 │   (willlaptop)  │   scrape every 15s │   (monitoring ns)   │
 │                 │                    │                      │
 │  node_exporter  │                    │  workstation rules  │
 │  systemd service│                    │  + scrape config    │
 └─────────────────┘                    └──────────────────────┘
                                                      │
                                                      v
                                         ┌──────────────────────┐
                                         │  Alertmanager        │
                                         │  (existing setup)    │
                                         │  unified routing     │
                                         └──────────────────────┘
 ```
 ### Data Flow
 1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
 2. **Prometheus** scrapes metrics every 15s via static target configuration
 3. **PrometheusRule** evaluates workstation-specific alert rules
 4. **Alertmanager** routes alerts to existing notification channels
 ## Workstation Deployment
 ### node_exporter Service
 **Installation:**
 ```bash
 pacman -S prometheus-node-exporter
 ```
 **Systemd Configuration:**
 - Service: `node_exporter.service`
 - User: `node_exporter` (created by package)
 - Listen address: `0.0.0.0:9100`
 - Restart policy: `always` with 10s delay
 - Logging: systemd journal (`journalctl -u node_exporter`)
 **ExecStart flags:**
 ```bash
 /usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
 ```
 Excludes system mounts to reduce noise.
 **Firewall Configuration:**
 - Allow TCP 9100 from cluster nodes
 - Use ufw or iptables to restrict access
 **Metrics Collected:**
 All default collectors except resource-intensive ones:
 - CPU, memory, filesystem, network
 - System stats (uptime, load average, systemd)
 - Thermal (if available on hardware)
 - Disk I/O
 ## Prometheus Configuration
 ### Static Scrape Target
 **Job configuration:**
 - Job name: `workstation/willlaptop`
 - Target: `willlaptop:9100` (DNS resolution) or workstation IP
 - Scrape interval: `15s` (matches cluster node_exporter)
 - Scrape timeout: `10s`
 - Metrics path: `/metrics`
 - Honor labels: `true`
 **Relabeling rules:**
 - Add `env: "workstation"` label for identification
 - Preserve `instance: "willlaptop"` from target
 **Integration:**
 Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
 - PrometheusRule with additional scrape config, or
 - Direct modification of Prometheus configuration
 ## Alert Rules
 ### PrometheusRule Resource
 **Namespace:** `monitoring`
 **Kind:** `PrometheusRule`
 **Labels:** Standard discovery labels for Prometheus operator
 ### Alert Categories
 #### Critical Alerts (Paging)
 1. **WorkstationDiskSpaceCritical**
   - Condition: `<5%` free on any mounted filesystem
   - Duration: 5m
   - Severity: critical
 2. **WorkstationMemoryCritical**
   - Condition: `>95%` memory usage
   - Duration: 5m
   - Severity: critical
 3. **WorkstationCPUCritical**
   - Condition: `>95%` CPU usage
   - Duration: 10m
   - Severity: critical
 4. **WorkstationSystemdFailed**
   - Condition: Failed systemd units detected
   - Duration: 5m
   - Severity: critical
 #### Warning Alerts (Email/Slack)
 1. **WorkstationDiskSpaceWarning**
   - Condition: `<10%` free on any mounted filesystem
   - Duration: 10m
   - Severity: warning
 2. **WorkstationMemoryWarning**
   - Condition: `>85%` memory usage
   - Duration: 10m
   - Severity: warning
 3. **WorkstationCPUWarning**
   - Condition: `>80%` CPU usage
   - Duration: 15m
   - Severity: warning
 4. **WorkstationLoadHigh**
   - Condition: 5m load average > # CPU cores
   - Duration: 10m
   - Severity: warning
 5. **WorkstationDiskInodeWarning**
   - Condition: `<10%` inodes free
   - Duration: 10m
   - Severity: warning
 6. **WorkstationNetworkErrors**
   - Condition: High packet loss or error rate
   - Duration: 10m
   - Severity: warning
 #### Info Alerts (Log Only)
 1. **WorkstationDiskSpaceInfo**
   - Condition: `<20%` free on any mounted filesystem
   - Duration: 15m
   - Severity: info
 2. **WorkstationUptime**
   - Condition: System uptime metric (recording rule)
   - Severity: info
 ### Alert Annotations
 Each alert includes:
 - `summary`: Brief description
 - `description`: Detailed explanation with metric values
 - `runbook_url`: Link to troubleshooting documentation (if available)
 ## Versioning
 ### Repository Structure
 ```
 ~/.claude/repos/homelab/charts/willlaptop-monitoring/
 ├── prometheus-rules.yaml         # PrometheusRule for workstation alerts
 ├── values.yaml                    # Configuration values
 └── README.md                     # Documentation
 ```
 ### Values.yaml Configuration
 Configurable parameters:
 ```yaml
 workstation:
  hostname: willlaptop
  ip: <workstation_ip>  # optional, fallback to DNS
 scrape:
  interval: 15s
  timeout: 10s
 alerts:
  disk:
    critical_percent: 5
    warning_percent: 10
    info_percent: 20
  memory:
    critical_percent: 95
    warning_percent: 85
  cpu:
    critical_percent: 95
    critical_duration: 10m
    warning_percent: 80
    warning_duration: 15m
 ```
 ### Integration with ArgoCD
 Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
 ## Testing and Verification
 ### Phase 1 - Workstation Deployment
 1. Verify node_exporter installation:
   ```bash
   pacman -Q prometheus-node-exporter
   ```
 2. Check systemd service status:
   ```bash
   systemctl status node_exporter
   ```
 3. Verify metrics endpoint locally:
   ```bash
   curl http://localhost:9100/metrics | head -20
   ```
 4. Test accessibility from cluster:
   ```bash
   kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
   ```
 ### Phase 2 - Prometheus Integration
 1. Verify Prometheus target:
   - Access Prometheus UI → Targets → workstation/willlaptop
   - Confirm target is UP
 2. Verify metric ingestion:
   ```bash
   # Query in Prometheus UI
   node_up{instance="willlaptop"}
   ```
 3. Verify label injection:
   - Confirm `env="workstation"` label appears on metrics
 ### Phase 3 - Alert Verification
 1. Review PrometheusRule:
   ```bash
   kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
   ```
 2. Verify rule evaluation:
   - Access Prometheus UI → Rules
   - Confirm workstation rules are active
 3. Test critical alert:
   - Temporarily trigger a low disk alert (or simulate)
   - Verify alert fires in Prometheus UI
 4. Verify Alertmanager integration:
   - Check Alertmanager UI → Alerts
   - Confirm workstation alerts are received
 ## Success Criteria
 - [ ] node_exporter running on workstation
 - [ ] Metrics accessible from cluster nodes
 - [ ] Prometheus scraping workstation metrics
 - [ ] Alert rules evaluated and firing correctly
 - [ ] Alerts routing through Alertmanager
 - [ ] Configuration versioned in homelab repository
 - [ ] Documentation complete
 ## Future Enhancements
 - Grafana dashboards for workstation metrics
 - Alert tuning based on observed patterns
 - Additional collectors (e.g., temperature sensors if available)
 - Integration with morning-report skill for health status