# Workstation Monitoring Design ## Overview Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging. **Reference:** Future consideration `fc-001` (workstation monitoring) ## Current Infrastructure - **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop) - **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana - **Network:** Direct network connectivity between workstation and cluster nodes - **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured ## Architecture ### Components ``` ┌─────────────────┐ HTTP/9100 ┌──────────────────────┐ │ Workstation │ ──────────────────> │ K8s Prometheus │ │ (willlaptop) │ scrape every 15s │ (monitoring ns) │ │ │ │ │ │ node_exporter │ │ workstation rules │ │ systemd service│ │ + scrape config │ └─────────────────┘ └──────────────────────┘ │ v ┌──────────────────────┐ │ Alertmanager │ │ (existing setup) │ │ unified routing │ └──────────────────────┘ ``` ### Data Flow 1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics` 2. **Prometheus** scrapes metrics every 15s via static target configuration 3. **PrometheusRule** evaluates workstation-specific alert rules 4. **Alertmanager** routes alerts to existing notification channels ## Workstation Deployment ### node_exporter Service **Installation:** ```bash pacman -S prometheus-node-exporter ``` **Systemd Configuration:** - Service: `node_exporter.service` - User: `node_exporter` (created by package) - Listen address: `0.0.0.0:9100` - Restart policy: `always` with 10s delay - Logging: systemd journal (`journalctl -u node_exporter`) **ExecStart flags:** ```bash /usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/) ``` Excludes system mounts to reduce noise. **Firewall Configuration:** - Allow TCP 9100 from cluster nodes - Use ufw or iptables to restrict access **Metrics Collected:** All default collectors except resource-intensive ones: - CPU, memory, filesystem, network - System stats (uptime, load average, systemd) - Thermal (if available on hardware) - Disk I/O ## Prometheus Configuration ### Static Scrape Target **Job configuration:** - Job name: `workstation/willlaptop` - Target: `willlaptop:9100` (DNS resolution) or workstation IP - Scrape interval: `15s` (matches cluster node_exporter) - Scrape timeout: `10s` - Metrics path: `/metrics` - Honor labels: `true` **Relabeling rules:** - Add `env: "workstation"` label for identification - Preserve `instance: "willlaptop"` from target **Integration:** Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via: - PrometheusRule with additional scrape config, or - Direct modification of Prometheus configuration ## Alert Rules ### PrometheusRule Resource **Namespace:** `monitoring` **Kind:** `PrometheusRule` **Labels:** Standard discovery labels for Prometheus operator ### Alert Categories #### Critical Alerts (Paging) 1. **WorkstationDiskSpaceCritical** - Condition: `<5%` free on any mounted filesystem - Duration: 5m - Severity: critical 2. **WorkstationMemoryCritical** - Condition: `>95%` memory usage - Duration: 5m - Severity: critical 3. **WorkstationCPUCritical** - Condition: `>95%` CPU usage - Duration: 10m - Severity: critical 4. **WorkstationSystemdFailed** - Condition: Failed systemd units detected - Duration: 5m - Severity: critical #### Warning Alerts (Email/Slack) 1. **WorkstationDiskSpaceWarning** - Condition: `<10%` free on any mounted filesystem - Duration: 10m - Severity: warning 2. **WorkstationMemoryWarning** - Condition: `>85%` memory usage - Duration: 10m - Severity: warning 3. **WorkstationCPUWarning** - Condition: `>80%` CPU usage - Duration: 15m - Severity: warning 4. **WorkstationLoadHigh** - Condition: 5m load average > # CPU cores - Duration: 10m - Severity: warning 5. **WorkstationDiskInodeWarning** - Condition: `<10%` inodes free - Duration: 10m - Severity: warning 6. **WorkstationNetworkErrors** - Condition: High packet loss or error rate - Duration: 10m - Severity: warning #### Info Alerts (Log Only) 1. **WorkstationDiskSpaceInfo** - Condition: `<20%` free on any mounted filesystem - Duration: 15m - Severity: info 2. **WorkstationUptime** - Condition: System uptime metric (recording rule) - Severity: info ### Alert Annotations Each alert includes: - `summary`: Brief description - `description`: Detailed explanation with metric values - `runbook_url`: Link to troubleshooting documentation (if available) ## Versioning ### Repository Structure ``` ~/.claude/repos/homelab/charts/willlaptop-monitoring/ ├── prometheus-rules.yaml # PrometheusRule for workstation alerts ├── values.yaml # Configuration values └── README.md # Documentation ``` ### Values.yaml Configuration Configurable parameters: ```yaml workstation: hostname: willlaptop ip: # optional, fallback to DNS scrape: interval: 15s timeout: 10s alerts: disk: critical_percent: 5 warning_percent: 10 info_percent: 20 memory: critical_percent: 95 warning_percent: 85 cpu: critical_percent: 95 critical_duration: 10m warning_percent: 80 warning_duration: 15m ``` ### Integration with ArgoCD Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired. ## Testing and Verification ### Phase 1 - Workstation Deployment 1. Verify node_exporter installation: ```bash pacman -Q prometheus-node-exporter ``` 2. Check systemd service status: ```bash systemctl status node_exporter ``` 3. Verify metrics endpoint locally: ```bash curl http://localhost:9100/metrics | head -20 ``` 4. Test accessibility from cluster: ```bash kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics ``` ### Phase 2 - Prometheus Integration 1. Verify Prometheus target: - Access Prometheus UI → Targets → workstation/willlaptop - Confirm target is UP 2. Verify metric ingestion: ```bash # Query in Prometheus UI node_up{instance="willlaptop"} ``` 3. Verify label injection: - Confirm `env="workstation"` label appears on metrics ### Phase 3 - Alert Verification 1. Review PrometheusRule: ```bash kubectl get prometheusrule workstation-alerts -n monitoring -o yaml ``` 2. Verify rule evaluation: - Access Prometheus UI → Rules - Confirm workstation rules are active 3. Test critical alert: - Temporarily trigger a low disk alert (or simulate) - Verify alert fires in Prometheus UI 4. Verify Alertmanager integration: - Check Alertmanager UI → Alerts - Confirm workstation alerts are received ## Success Criteria - [ ] node_exporter running on workstation - [ ] Metrics accessible from cluster nodes - [ ] Prometheus scraping workstation metrics - [ ] Alert rules evaluated and firing correctly - [ ] Alerts routing through Alertmanager - [ ] Configuration versioned in homelab repository - [ ] Documentation complete ## Future Enhancements - Grafana dashboards for workstation metrics - Alert tuning based on observed patterns - Additional collectors (e.g., temperature sensors if available) - Integration with morning-report skill for health status