Add workstation monitoring design
This commit is contained in:
296
docs/plans/2026-01-05-workstation-monitoring-design.md
Normal file
296
docs/plans/2026-01-05-workstation-monitoring-design.md
Normal file
@@ -0,0 +1,296 @@
|
|||||||
|
# Workstation Monitoring Design
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
|
||||||
|
|
||||||
|
**Reference:** Future consideration `fc-001` (workstation monitoring)
|
||||||
|
|
||||||
|
## Current Infrastructure
|
||||||
|
|
||||||
|
- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
|
||||||
|
- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
|
||||||
|
- **Network:** Direct network connectivity between workstation and cluster nodes
|
||||||
|
- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Components
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────┐ HTTP/9100 ┌──────────────────────┐
|
||||||
|
│ Workstation │ ──────────────────> │ K8s Prometheus │
|
||||||
|
│ (willlaptop) │ scrape every 15s │ (monitoring ns) │
|
||||||
|
│ │ │ │
|
||||||
|
│ node_exporter │ │ workstation rules │
|
||||||
|
│ systemd service│ │ + scrape config │
|
||||||
|
└─────────────────┘ └──────────────────────┘
|
||||||
|
│
|
||||||
|
v
|
||||||
|
┌──────────────────────┐
|
||||||
|
│ Alertmanager │
|
||||||
|
│ (existing setup) │
|
||||||
|
│ unified routing │
|
||||||
|
└──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Flow
|
||||||
|
|
||||||
|
1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
|
||||||
|
2. **Prometheus** scrapes metrics every 15s via static target configuration
|
||||||
|
3. **PrometheusRule** evaluates workstation-specific alert rules
|
||||||
|
4. **Alertmanager** routes alerts to existing notification channels
|
||||||
|
|
||||||
|
## Workstation Deployment
|
||||||
|
|
||||||
|
### node_exporter Service
|
||||||
|
|
||||||
|
**Installation:**
|
||||||
|
```bash
|
||||||
|
pacman -S prometheus-node-exporter
|
||||||
|
```
|
||||||
|
|
||||||
|
**Systemd Configuration:**
|
||||||
|
- Service: `node_exporter.service`
|
||||||
|
- User: `node_exporter` (created by package)
|
||||||
|
- Listen address: `0.0.0.0:9100`
|
||||||
|
- Restart policy: `always` with 10s delay
|
||||||
|
- Logging: systemd journal (`journalctl -u node_exporter`)
|
||||||
|
|
||||||
|
**ExecStart flags:**
|
||||||
|
```bash
|
||||||
|
/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
|
||||||
|
```
|
||||||
|
|
||||||
|
Excludes system mounts to reduce noise.
|
||||||
|
|
||||||
|
**Firewall Configuration:**
|
||||||
|
- Allow TCP 9100 from cluster nodes
|
||||||
|
- Use ufw or iptables to restrict access
|
||||||
|
|
||||||
|
**Metrics Collected:**
|
||||||
|
All default collectors except resource-intensive ones:
|
||||||
|
- CPU, memory, filesystem, network
|
||||||
|
- System stats (uptime, load average, systemd)
|
||||||
|
- Thermal (if available on hardware)
|
||||||
|
- Disk I/O
|
||||||
|
|
||||||
|
## Prometheus Configuration
|
||||||
|
|
||||||
|
### Static Scrape Target
|
||||||
|
|
||||||
|
**Job configuration:**
|
||||||
|
- Job name: `workstation/willlaptop`
|
||||||
|
- Target: `willlaptop:9100` (DNS resolution) or workstation IP
|
||||||
|
- Scrape interval: `15s` (matches cluster node_exporter)
|
||||||
|
- Scrape timeout: `10s`
|
||||||
|
- Metrics path: `/metrics`
|
||||||
|
- Honor labels: `true`
|
||||||
|
|
||||||
|
**Relabeling rules:**
|
||||||
|
- Add `env: "workstation"` label for identification
|
||||||
|
- Preserve `instance: "willlaptop"` from target
|
||||||
|
|
||||||
|
**Integration:**
|
||||||
|
Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
|
||||||
|
- PrometheusRule with additional scrape config, or
|
||||||
|
- Direct modification of Prometheus configuration
|
||||||
|
|
||||||
|
## Alert Rules
|
||||||
|
|
||||||
|
### PrometheusRule Resource
|
||||||
|
|
||||||
|
**Namespace:** `monitoring`
|
||||||
|
**Kind:** `PrometheusRule`
|
||||||
|
**Labels:** Standard discovery labels for Prometheus operator
|
||||||
|
|
||||||
|
### Alert Categories
|
||||||
|
|
||||||
|
#### Critical Alerts (Paging)
|
||||||
|
|
||||||
|
1. **WorkstationDiskSpaceCritical**
|
||||||
|
- Condition: `<5%` free on any mounted filesystem
|
||||||
|
- Duration: 5m
|
||||||
|
- Severity: critical
|
||||||
|
|
||||||
|
2. **WorkstationMemoryCritical**
|
||||||
|
- Condition: `>95%` memory usage
|
||||||
|
- Duration: 5m
|
||||||
|
- Severity: critical
|
||||||
|
|
||||||
|
3. **WorkstationCPUCritical**
|
||||||
|
- Condition: `>95%` CPU usage
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: critical
|
||||||
|
|
||||||
|
4. **WorkstationSystemdFailed**
|
||||||
|
- Condition: Failed systemd units detected
|
||||||
|
- Duration: 5m
|
||||||
|
- Severity: critical
|
||||||
|
|
||||||
|
#### Warning Alerts (Email/Slack)
|
||||||
|
|
||||||
|
1. **WorkstationDiskSpaceWarning**
|
||||||
|
- Condition: `<10%` free on any mounted filesystem
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
2. **WorkstationMemoryWarning**
|
||||||
|
- Condition: `>85%` memory usage
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
3. **WorkstationCPUWarning**
|
||||||
|
- Condition: `>80%` CPU usage
|
||||||
|
- Duration: 15m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
4. **WorkstationLoadHigh**
|
||||||
|
- Condition: 5m load average > # CPU cores
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
5. **WorkstationDiskInodeWarning**
|
||||||
|
- Condition: `<10%` inodes free
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
6. **WorkstationNetworkErrors**
|
||||||
|
- Condition: High packet loss or error rate
|
||||||
|
- Duration: 10m
|
||||||
|
- Severity: warning
|
||||||
|
|
||||||
|
#### Info Alerts (Log Only)
|
||||||
|
|
||||||
|
1. **WorkstationDiskSpaceInfo**
|
||||||
|
- Condition: `<20%` free on any mounted filesystem
|
||||||
|
- Duration: 15m
|
||||||
|
- Severity: info
|
||||||
|
|
||||||
|
2. **WorkstationUptime**
|
||||||
|
- Condition: System uptime metric (recording rule)
|
||||||
|
- Severity: info
|
||||||
|
|
||||||
|
### Alert Annotations
|
||||||
|
|
||||||
|
Each alert includes:
|
||||||
|
- `summary`: Brief description
|
||||||
|
- `description`: Detailed explanation with metric values
|
||||||
|
- `runbook_url`: Link to troubleshooting documentation (if available)
|
||||||
|
|
||||||
|
## Versioning
|
||||||
|
|
||||||
|
### Repository Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
~/.claude/repos/homelab/charts/willlaptop-monitoring/
|
||||||
|
├── prometheus-rules.yaml # PrometheusRule for workstation alerts
|
||||||
|
├── values.yaml # Configuration values
|
||||||
|
└── README.md # Documentation
|
||||||
|
```
|
||||||
|
|
||||||
|
### Values.yaml Configuration
|
||||||
|
|
||||||
|
Configurable parameters:
|
||||||
|
```yaml
|
||||||
|
workstation:
|
||||||
|
hostname: willlaptop
|
||||||
|
ip: <workstation_ip> # optional, fallback to DNS
|
||||||
|
|
||||||
|
scrape:
|
||||||
|
interval: 15s
|
||||||
|
timeout: 10s
|
||||||
|
|
||||||
|
alerts:
|
||||||
|
disk:
|
||||||
|
critical_percent: 5
|
||||||
|
warning_percent: 10
|
||||||
|
info_percent: 20
|
||||||
|
memory:
|
||||||
|
critical_percent: 95
|
||||||
|
warning_percent: 85
|
||||||
|
cpu:
|
||||||
|
critical_percent: 95
|
||||||
|
critical_duration: 10m
|
||||||
|
warning_percent: 80
|
||||||
|
warning_duration: 15m
|
||||||
|
```
|
||||||
|
|
||||||
|
### Integration with ArgoCD
|
||||||
|
|
||||||
|
Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
|
||||||
|
|
||||||
|
## Testing and Verification
|
||||||
|
|
||||||
|
### Phase 1 - Workstation Deployment
|
||||||
|
|
||||||
|
1. Verify node_exporter installation:
|
||||||
|
```bash
|
||||||
|
pacman -Q prometheus-node-exporter
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check systemd service status:
|
||||||
|
```bash
|
||||||
|
systemctl status node_exporter
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Verify metrics endpoint locally:
|
||||||
|
```bash
|
||||||
|
curl http://localhost:9100/metrics | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Test accessibility from cluster:
|
||||||
|
```bash
|
||||||
|
kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
|
||||||
|
```
|
||||||
|
|
||||||
|
### Phase 2 - Prometheus Integration
|
||||||
|
|
||||||
|
1. Verify Prometheus target:
|
||||||
|
- Access Prometheus UI → Targets → workstation/willlaptop
|
||||||
|
- Confirm target is UP
|
||||||
|
|
||||||
|
2. Verify metric ingestion:
|
||||||
|
```bash
|
||||||
|
# Query in Prometheus UI
|
||||||
|
node_up{instance="willlaptop"}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Verify label injection:
|
||||||
|
- Confirm `env="workstation"` label appears on metrics
|
||||||
|
|
||||||
|
### Phase 3 - Alert Verification
|
||||||
|
|
||||||
|
1. Review PrometheusRule:
|
||||||
|
```bash
|
||||||
|
kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Verify rule evaluation:
|
||||||
|
- Access Prometheus UI → Rules
|
||||||
|
- Confirm workstation rules are active
|
||||||
|
|
||||||
|
3. Test critical alert:
|
||||||
|
- Temporarily trigger a low disk alert (or simulate)
|
||||||
|
- Verify alert fires in Prometheus UI
|
||||||
|
|
||||||
|
4. Verify Alertmanager integration:
|
||||||
|
- Check Alertmanager UI → Alerts
|
||||||
|
- Confirm workstation alerts are received
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
- [ ] node_exporter running on workstation
|
||||||
|
- [ ] Metrics accessible from cluster nodes
|
||||||
|
- [ ] Prometheus scraping workstation metrics
|
||||||
|
- [ ] Alert rules evaluated and firing correctly
|
||||||
|
- [ ] Alerts routing through Alertmanager
|
||||||
|
- [ ] Configuration versioned in homelab repository
|
||||||
|
- [ ] Documentation complete
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
- Grafana dashboards for workstation metrics
|
||||||
|
- Alert tuning based on observed patterns
|
||||||
|
- Additional collectors (e.g., temperature sensors if available)
|
||||||
|
- Integration with morning-report skill for health status
|
||||||
Reference in New Issue
Block a user