Add workstation monitoring design
This commit is contained in:
296
docs/plans/2026-01-05-workstation-monitoring-design.md
Normal file
296
docs/plans/2026-01-05-workstation-monitoring-design.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Workstation Monitoring Design
|
||||
|
||||
## Overview
|
||||
|
||||
Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
|
||||
|
||||
**Reference:** Future consideration `fc-001` (workstation monitoring)
|
||||
|
||||
## Current Infrastructure
|
||||
|
||||
- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
|
||||
- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
|
||||
- **Network:** Direct network connectivity between workstation and cluster nodes
|
||||
- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
```
|
||||
┌─────────────────┐ HTTP/9100 ┌──────────────────────┐
|
||||
│ Workstation │ ──────────────────> │ K8s Prometheus │
|
||||
│ (willlaptop) │ scrape every 15s │ (monitoring ns) │
|
||||
│ │ │ │
|
||||
│ node_exporter │ │ workstation rules │
|
||||
│ systemd service│ │ + scrape config │
|
||||
└─────────────────┘ └──────────────────────┘
|
||||
│
|
||||
v
|
||||
┌──────────────────────┐
|
||||
│ Alertmanager │
|
||||
│ (existing setup) │
|
||||
│ unified routing │
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
|
||||
1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
|
||||
2. **Prometheus** scrapes metrics every 15s via static target configuration
|
||||
3. **PrometheusRule** evaluates workstation-specific alert rules
|
||||
4. **Alertmanager** routes alerts to existing notification channels
|
||||
|
||||
## Workstation Deployment
|
||||
|
||||
### node_exporter Service
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
pacman -S prometheus-node-exporter
|
||||
```
|
||||
|
||||
**Systemd Configuration:**
|
||||
- Service: `node_exporter.service`
|
||||
- User: `node_exporter` (created by package)
|
||||
- Listen address: `0.0.0.0:9100`
|
||||
- Restart policy: `always` with 10s delay
|
||||
- Logging: systemd journal (`journalctl -u node_exporter`)
|
||||
|
||||
**ExecStart flags:**
|
||||
```bash
|
||||
/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
|
||||
```
|
||||
|
||||
Excludes system mounts to reduce noise.
|
||||
|
||||
**Firewall Configuration:**
|
||||
- Allow TCP 9100 from cluster nodes
|
||||
- Use ufw or iptables to restrict access
|
||||
|
||||
**Metrics Collected:**
|
||||
All default collectors except resource-intensive ones:
|
||||
- CPU, memory, filesystem, network
|
||||
- System stats (uptime, load average, systemd)
|
||||
- Thermal (if available on hardware)
|
||||
- Disk I/O
|
||||
|
||||
## Prometheus Configuration
|
||||
|
||||
### Static Scrape Target
|
||||
|
||||
**Job configuration:**
|
||||
- Job name: `workstation/willlaptop`
|
||||
- Target: `willlaptop:9100` (DNS resolution) or workstation IP
|
||||
- Scrape interval: `15s` (matches cluster node_exporter)
|
||||
- Scrape timeout: `10s`
|
||||
- Metrics path: `/metrics`
|
||||
- Honor labels: `true`
|
||||
|
||||
**Relabeling rules:**
|
||||
- Add `env: "workstation"` label for identification
|
||||
- Preserve `instance: "willlaptop"` from target
|
||||
|
||||
**Integration:**
|
||||
Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
|
||||
- PrometheusRule with additional scrape config, or
|
||||
- Direct modification of Prometheus configuration
|
||||
|
||||
## Alert Rules
|
||||
|
||||
### PrometheusRule Resource
|
||||
|
||||
**Namespace:** `monitoring`
|
||||
**Kind:** `PrometheusRule`
|
||||
**Labels:** Standard discovery labels for Prometheus operator
|
||||
|
||||
### Alert Categories
|
||||
|
||||
#### Critical Alerts (Paging)
|
||||
|
||||
1. **WorkstationDiskSpaceCritical**
|
||||
- Condition: `<5%` free on any mounted filesystem
|
||||
- Duration: 5m
|
||||
- Severity: critical
|
||||
|
||||
2. **WorkstationMemoryCritical**
|
||||
- Condition: `>95%` memory usage
|
||||
- Duration: 5m
|
||||
- Severity: critical
|
||||
|
||||
3. **WorkstationCPUCritical**
|
||||
- Condition: `>95%` CPU usage
|
||||
- Duration: 10m
|
||||
- Severity: critical
|
||||
|
||||
4. **WorkstationSystemdFailed**
|
||||
- Condition: Failed systemd units detected
|
||||
- Duration: 5m
|
||||
- Severity: critical
|
||||
|
||||
#### Warning Alerts (Email/Slack)
|
||||
|
||||
1. **WorkstationDiskSpaceWarning**
|
||||
- Condition: `<10%` free on any mounted filesystem
|
||||
- Duration: 10m
|
||||
- Severity: warning
|
||||
|
||||
2. **WorkstationMemoryWarning**
|
||||
- Condition: `>85%` memory usage
|
||||
- Duration: 10m
|
||||
- Severity: warning
|
||||
|
||||
3. **WorkstationCPUWarning**
|
||||
- Condition: `>80%` CPU usage
|
||||
- Duration: 15m
|
||||
- Severity: warning
|
||||
|
||||
4. **WorkstationLoadHigh**
|
||||
- Condition: 5m load average > # CPU cores
|
||||
- Duration: 10m
|
||||
- Severity: warning
|
||||
|
||||
5. **WorkstationDiskInodeWarning**
|
||||
- Condition: `<10%` inodes free
|
||||
- Duration: 10m
|
||||
- Severity: warning
|
||||
|
||||
6. **WorkstationNetworkErrors**
|
||||
- Condition: High packet loss or error rate
|
||||
- Duration: 10m
|
||||
- Severity: warning
|
||||
|
||||
#### Info Alerts (Log Only)
|
||||
|
||||
1. **WorkstationDiskSpaceInfo**
|
||||
- Condition: `<20%` free on any mounted filesystem
|
||||
- Duration: 15m
|
||||
- Severity: info
|
||||
|
||||
2. **WorkstationUptime**
|
||||
- Condition: System uptime metric (recording rule)
|
||||
- Severity: info
|
||||
|
||||
### Alert Annotations
|
||||
|
||||
Each alert includes:
|
||||
- `summary`: Brief description
|
||||
- `description`: Detailed explanation with metric values
|
||||
- `runbook_url`: Link to troubleshooting documentation (if available)
|
||||
|
||||
## Versioning
|
||||
|
||||
### Repository Structure
|
||||
|
||||
```
|
||||
~/.claude/repos/homelab/charts/willlaptop-monitoring/
|
||||
├── prometheus-rules.yaml # PrometheusRule for workstation alerts
|
||||
├── values.yaml # Configuration values
|
||||
└── README.md # Documentation
|
||||
```
|
||||
|
||||
### Values.yaml Configuration
|
||||
|
||||
Configurable parameters:
|
||||
```yaml
|
||||
workstation:
|
||||
hostname: willlaptop
|
||||
ip: <workstation_ip> # optional, fallback to DNS
|
||||
|
||||
scrape:
|
||||
interval: 15s
|
||||
timeout: 10s
|
||||
|
||||
alerts:
|
||||
disk:
|
||||
critical_percent: 5
|
||||
warning_percent: 10
|
||||
info_percent: 20
|
||||
memory:
|
||||
critical_percent: 95
|
||||
warning_percent: 85
|
||||
cpu:
|
||||
critical_percent: 95
|
||||
critical_duration: 10m
|
||||
warning_percent: 80
|
||||
warning_duration: 15m
|
||||
```
|
||||
|
||||
### Integration with ArgoCD
|
||||
|
||||
Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
|
||||
|
||||
## Testing and Verification
|
||||
|
||||
### Phase 1 - Workstation Deployment
|
||||
|
||||
1. Verify node_exporter installation:
|
||||
```bash
|
||||
pacman -Q prometheus-node-exporter
|
||||
```
|
||||
|
||||
2. Check systemd service status:
|
||||
```bash
|
||||
systemctl status node_exporter
|
||||
```
|
||||
|
||||
3. Verify metrics endpoint locally:
|
||||
```bash
|
||||
curl http://localhost:9100/metrics | head -20
|
||||
```
|
||||
|
||||
4. Test accessibility from cluster:
|
||||
```bash
|
||||
kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
|
||||
```
|
||||
|
||||
### Phase 2 - Prometheus Integration
|
||||
|
||||
1. Verify Prometheus target:
|
||||
- Access Prometheus UI → Targets → workstation/willlaptop
|
||||
- Confirm target is UP
|
||||
|
||||
2. Verify metric ingestion:
|
||||
```bash
|
||||
# Query in Prometheus UI
|
||||
node_up{instance="willlaptop"}
|
||||
```
|
||||
|
||||
3. Verify label injection:
|
||||
- Confirm `env="workstation"` label appears on metrics
|
||||
|
||||
### Phase 3 - Alert Verification
|
||||
|
||||
1. Review PrometheusRule:
|
||||
```bash
|
||||
kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
|
||||
```
|
||||
|
||||
2. Verify rule evaluation:
|
||||
- Access Prometheus UI → Rules
|
||||
- Confirm workstation rules are active
|
||||
|
||||
3. Test critical alert:
|
||||
- Temporarily trigger a low disk alert (or simulate)
|
||||
- Verify alert fires in Prometheus UI
|
||||
|
||||
4. Verify Alertmanager integration:
|
||||
- Check Alertmanager UI → Alerts
|
||||
- Confirm workstation alerts are received
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] node_exporter running on workstation
|
||||
- [ ] Metrics accessible from cluster nodes
|
||||
- [ ] Prometheus scraping workstation metrics
|
||||
- [ ] Alert rules evaluated and firing correctly
|
||||
- [ ] Alerts routing through Alertmanager
|
||||
- [ ] Configuration versioned in homelab repository
|
||||
- [ ] Documentation complete
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Grafana dashboards for workstation metrics
|
||||
- Alert tuning based on observed patterns
|
||||
- Additional collectors (e.g., temperature sensors if available)
|
||||
- Integration with morning-report skill for health status
|
||||
Reference in New Issue
Block a user