diff --git a/docs/plans/2026-01-05-workstation-monitoring-design.md b/docs/plans/2026-01-05-workstation-monitoring-design.md new file mode 100644 index 0000000..2bdc8bd --- /dev/null +++ b/docs/plans/2026-01-05-workstation-monitoring-design.md @@ -0,0 +1,296 @@ +# Workstation Monitoring Design + +## Overview + +Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging. + +**Reference:** Future consideration `fc-001` (workstation monitoring) + +## Current Infrastructure + +- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop) +- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana +- **Network:** Direct network connectivity between workstation and cluster nodes +- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured + +## Architecture + +### Components + +``` +┌─────────────────┐ HTTP/9100 ┌──────────────────────┐ +│ Workstation │ ──────────────────> │ K8s Prometheus │ +│ (willlaptop) │ scrape every 15s │ (monitoring ns) │ +│ │ │ │ +│ node_exporter │ │ workstation rules │ +│ systemd service│ │ + scrape config │ +└─────────────────┘ └──────────────────────┘ + │ + v + ┌──────────────────────┐ + │ Alertmanager │ + │ (existing setup) │ + │ unified routing │ + └──────────────────────┘ +``` + +### Data Flow + +1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics` +2. **Prometheus** scrapes metrics every 15s via static target configuration +3. **PrometheusRule** evaluates workstation-specific alert rules +4. **Alertmanager** routes alerts to existing notification channels + +## Workstation Deployment + +### node_exporter Service + +**Installation:** +```bash +pacman -S prometheus-node-exporter +``` + +**Systemd Configuration:** +- Service: `node_exporter.service` +- User: `node_exporter` (created by package) +- Listen address: `0.0.0.0:9100` +- Restart policy: `always` with 10s delay +- Logging: systemd journal (`journalctl -u node_exporter`) + +**ExecStart flags:** +```bash +/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/) +``` + +Excludes system mounts to reduce noise. + +**Firewall Configuration:** +- Allow TCP 9100 from cluster nodes +- Use ufw or iptables to restrict access + +**Metrics Collected:** +All default collectors except resource-intensive ones: +- CPU, memory, filesystem, network +- System stats (uptime, load average, systemd) +- Thermal (if available on hardware) +- Disk I/O + +## Prometheus Configuration + +### Static Scrape Target + +**Job configuration:** +- Job name: `workstation/willlaptop` +- Target: `willlaptop:9100` (DNS resolution) or workstation IP +- Scrape interval: `15s` (matches cluster node_exporter) +- Scrape timeout: `10s` +- Metrics path: `/metrics` +- Honor labels: `true` + +**Relabeling rules:** +- Add `env: "workstation"` label for identification +- Preserve `instance: "willlaptop"` from target + +**Integration:** +Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via: +- PrometheusRule with additional scrape config, or +- Direct modification of Prometheus configuration + +## Alert Rules + +### PrometheusRule Resource + +**Namespace:** `monitoring` +**Kind:** `PrometheusRule` +**Labels:** Standard discovery labels for Prometheus operator + +### Alert Categories + +#### Critical Alerts (Paging) + +1. **WorkstationDiskSpaceCritical** + - Condition: `<5%` free on any mounted filesystem + - Duration: 5m + - Severity: critical + +2. **WorkstationMemoryCritical** + - Condition: `>95%` memory usage + - Duration: 5m + - Severity: critical + +3. **WorkstationCPUCritical** + - Condition: `>95%` CPU usage + - Duration: 10m + - Severity: critical + +4. **WorkstationSystemdFailed** + - Condition: Failed systemd units detected + - Duration: 5m + - Severity: critical + +#### Warning Alerts (Email/Slack) + +1. **WorkstationDiskSpaceWarning** + - Condition: `<10%` free on any mounted filesystem + - Duration: 10m + - Severity: warning + +2. **WorkstationMemoryWarning** + - Condition: `>85%` memory usage + - Duration: 10m + - Severity: warning + +3. **WorkstationCPUWarning** + - Condition: `>80%` CPU usage + - Duration: 15m + - Severity: warning + +4. **WorkstationLoadHigh** + - Condition: 5m load average > # CPU cores + - Duration: 10m + - Severity: warning + +5. **WorkstationDiskInodeWarning** + - Condition: `<10%` inodes free + - Duration: 10m + - Severity: warning + +6. **WorkstationNetworkErrors** + - Condition: High packet loss or error rate + - Duration: 10m + - Severity: warning + +#### Info Alerts (Log Only) + +1. **WorkstationDiskSpaceInfo** + - Condition: `<20%` free on any mounted filesystem + - Duration: 15m + - Severity: info + +2. **WorkstationUptime** + - Condition: System uptime metric (recording rule) + - Severity: info + +### Alert Annotations + +Each alert includes: +- `summary`: Brief description +- `description`: Detailed explanation with metric values +- `runbook_url`: Link to troubleshooting documentation (if available) + +## Versioning + +### Repository Structure + +``` +~/.claude/repos/homelab/charts/willlaptop-monitoring/ +├── prometheus-rules.yaml # PrometheusRule for workstation alerts +├── values.yaml # Configuration values +└── README.md # Documentation +``` + +### Values.yaml Configuration + +Configurable parameters: +```yaml +workstation: + hostname: willlaptop + ip: # optional, fallback to DNS + +scrape: + interval: 15s + timeout: 10s + +alerts: + disk: + critical_percent: 5 + warning_percent: 10 + info_percent: 20 + memory: + critical_percent: 95 + warning_percent: 85 + cpu: + critical_percent: 95 + critical_duration: 10m + warning_percent: 80 + warning_duration: 15m +``` + +### Integration with ArgoCD + +Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired. + +## Testing and Verification + +### Phase 1 - Workstation Deployment + +1. Verify node_exporter installation: + ```bash + pacman -Q prometheus-node-exporter + ``` + +2. Check systemd service status: + ```bash + systemctl status node_exporter + ``` + +3. Verify metrics endpoint locally: + ```bash + curl http://localhost:9100/metrics | head -20 + ``` + +4. Test accessibility from cluster: + ```bash + kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics + ``` + +### Phase 2 - Prometheus Integration + +1. Verify Prometheus target: + - Access Prometheus UI → Targets → workstation/willlaptop + - Confirm target is UP + +2. Verify metric ingestion: + ```bash + # Query in Prometheus UI + node_up{instance="willlaptop"} + ``` + +3. Verify label injection: + - Confirm `env="workstation"` label appears on metrics + +### Phase 3 - Alert Verification + +1. Review PrometheusRule: + ```bash + kubectl get prometheusrule workstation-alerts -n monitoring -o yaml + ``` + +2. Verify rule evaluation: + - Access Prometheus UI → Rules + - Confirm workstation rules are active + +3. Test critical alert: + - Temporarily trigger a low disk alert (or simulate) + - Verify alert fires in Prometheus UI + +4. Verify Alertmanager integration: + - Check Alertmanager UI → Alerts + - Confirm workstation alerts are received + +## Success Criteria + +- [ ] node_exporter running on workstation +- [ ] Metrics accessible from cluster nodes +- [ ] Prometheus scraping workstation metrics +- [ ] Alert rules evaluated and firing correctly +- [ ] Alerts routing through Alertmanager +- [ ] Configuration versioned in homelab repository +- [ ] Documentation complete + +## Future Enhancements + +- Grafana dashboards for workstation metrics +- Alert tuning based on observed patterns +- Additional collectors (e.g., temperature sensors if available) +- Integration with morning-report skill for health status