8.3 KiB
Workstation Monitoring Design
Overview
Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
Reference: Future consideration fc-001 (workstation monitoring)
Current Infrastructure
- Workstation: Arch Linux on MacBookPro9,2 (hostname: willlaptop)
- K8s Cluster: kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
- Network: Direct network connectivity between workstation and cluster nodes
- Existing Monitoring: 3 node_exporters on cluster nodes, cluster-level alerts configured
Architecture
Components
┌─────────────────┐ HTTP/9100 ┌──────────────────────┐
│ Workstation │ ──────────────────> │ K8s Prometheus │
│ (willlaptop) │ scrape every 15s │ (monitoring ns) │
│ │ │ │
│ node_exporter │ │ workstation rules │
│ systemd service│ │ + scrape config │
└─────────────────┘ └──────────────────────┘
│
v
┌──────────────────────┐
│ Alertmanager │
│ (existing setup) │
│ unified routing │
└──────────────────────┘
Data Flow
- node_exporter exposes metrics on
http://willlaptop:9100/metrics - Prometheus scrapes metrics every 15s via static target configuration
- PrometheusRule evaluates workstation-specific alert rules
- Alertmanager routes alerts to existing notification channels
Workstation Deployment
node_exporter Service
Installation:
pacman -S prometheus-node-exporter
Systemd Configuration:
- Service:
node_exporter.service - User:
node_exporter(created by package) - Listen address:
0.0.0.0:9100 - Restart policy:
alwayswith 10s delay - Logging: systemd journal (
journalctl -u node_exporter)
ExecStart flags:
/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
Excludes system mounts to reduce noise.
Firewall Configuration:
- Allow TCP 9100 from cluster nodes
- Use ufw or iptables to restrict access
Metrics Collected: All default collectors except resource-intensive ones:
- CPU, memory, filesystem, network
- System stats (uptime, load average, systemd)
- Thermal (if available on hardware)
- Disk I/O
Prometheus Configuration
Static Scrape Target
Job configuration:
- Job name:
workstation/willlaptop - Target:
willlaptop:9100(DNS resolution) or workstation IP - Scrape interval:
15s(matches cluster node_exporter) - Scrape timeout:
10s - Metrics path:
/metrics - Honor labels:
true
Relabeling rules:
- Add
env: "workstation"label for identification - Preserve
instance: "willlaptop"from target
Integration: Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
- PrometheusRule with additional scrape config, or
- Direct modification of Prometheus configuration
Alert Rules
PrometheusRule Resource
Namespace: monitoring
Kind: PrometheusRule
Labels: Standard discovery labels for Prometheus operator
Alert Categories
Critical Alerts (Paging)
-
WorkstationDiskSpaceCritical
- Condition:
<5%free on any mounted filesystem - Duration: 5m
- Severity: critical
- Condition:
-
WorkstationMemoryCritical
- Condition:
>95%memory usage - Duration: 5m
- Severity: critical
- Condition:
-
WorkstationCPUCritical
- Condition:
>95%CPU usage - Duration: 10m
- Severity: critical
- Condition:
-
WorkstationSystemdFailed
- Condition: Failed systemd units detected
- Duration: 5m
- Severity: critical
Warning Alerts (Email/Slack)
-
WorkstationDiskSpaceWarning
- Condition:
<10%free on any mounted filesystem - Duration: 10m
- Severity: warning
- Condition:
-
WorkstationMemoryWarning
- Condition:
>85%memory usage - Duration: 10m
- Severity: warning
- Condition:
-
WorkstationCPUWarning
- Condition:
>80%CPU usage - Duration: 15m
- Severity: warning
- Condition:
-
WorkstationLoadHigh
- Condition: 5m load average > # CPU cores
- Duration: 10m
- Severity: warning
-
WorkstationDiskInodeWarning
- Condition:
<10%inodes free - Duration: 10m
- Severity: warning
- Condition:
-
WorkstationNetworkErrors
- Condition: High packet loss or error rate
- Duration: 10m
- Severity: warning
Info Alerts (Log Only)
-
WorkstationDiskSpaceInfo
- Condition:
<20%free on any mounted filesystem - Duration: 15m
- Severity: info
- Condition:
-
WorkstationUptime
- Condition: System uptime metric (recording rule)
- Severity: info
Alert Annotations
Each alert includes:
summary: Brief descriptiondescription: Detailed explanation with metric valuesrunbook_url: Link to troubleshooting documentation (if available)
Versioning
Repository Structure
~/.claude/repos/homelab/charts/willlaptop-monitoring/
├── prometheus-rules.yaml # PrometheusRule for workstation alerts
├── values.yaml # Configuration values
└── README.md # Documentation
Values.yaml Configuration
Configurable parameters:
workstation:
hostname: willlaptop
ip: <workstation_ip> # optional, fallback to DNS
scrape:
interval: 15s
timeout: 10s
alerts:
disk:
critical_percent: 5
warning_percent: 10
info_percent: 20
memory:
critical_percent: 95
warning_percent: 85
cpu:
critical_percent: 95
critical_duration: 10m
warning_percent: 80
warning_duration: 15m
Integration with ArgoCD
Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
Testing and Verification
Phase 1 - Workstation Deployment
-
Verify node_exporter installation:
pacman -Q prometheus-node-exporter -
Check systemd service status:
systemctl status node_exporter -
Verify metrics endpoint locally:
curl http://localhost:9100/metrics | head -20 -
Test accessibility from cluster:
kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
Phase 2 - Prometheus Integration
-
Verify Prometheus target:
- Access Prometheus UI → Targets → workstation/willlaptop
- Confirm target is UP
-
Verify metric ingestion:
# Query in Prometheus UI node_up{instance="willlaptop"} -
Verify label injection:
- Confirm
env="workstation"label appears on metrics
- Confirm
Phase 3 - Alert Verification
-
Review PrometheusRule:
kubectl get prometheusrule workstation-alerts -n monitoring -o yaml -
Verify rule evaluation:
- Access Prometheus UI → Rules
- Confirm workstation rules are active
-
Test critical alert:
- Temporarily trigger a low disk alert (or simulate)
- Verify alert fires in Prometheus UI
-
Verify Alertmanager integration:
- Check Alertmanager UI → Alerts
- Confirm workstation alerts are received
Success Criteria
- node_exporter running on workstation
- Metrics accessible from cluster nodes
- Prometheus scraping workstation metrics
- Alert rules evaluated and firing correctly
- Alerts routing through Alertmanager
- Configuration versioned in homelab repository
- Documentation complete
Future Enhancements
- Grafana dashboards for workstation metrics
- Alert tuning based on observed patterns
- Additional collectors (e.g., temperature sensors if available)
- Integration with morning-report skill for health status