Add workstation monitoring design

This commit is contained in:
OpenCode Test
2026-01-05 01:31:10 -08:00
parent f3cb082c36
commit 62050faedc

View File

@@ -0,0 +1,296 @@
# Workstation Monitoring Design
## Overview
Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.
**Reference:** Future consideration `fc-001` (workstation monitoring)
## Current Infrastructure
- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
- **Network:** Direct network connectivity between workstation and cluster nodes
- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured
## Architecture
### Components
```
┌─────────────────┐ HTTP/9100 ┌──────────────────────┐
│ Workstation │ ──────────────────> │ K8s Prometheus │
│ (willlaptop) │ scrape every 15s │ (monitoring ns) │
│ │ │ │
│ node_exporter │ │ workstation rules │
│ systemd service│ │ + scrape config │
└─────────────────┘ └──────────────────────┘
v
┌──────────────────────┐
│ Alertmanager │
│ (existing setup) │
│ unified routing │
└──────────────────────┘
```
### Data Flow
1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
2. **Prometheus** scrapes metrics every 15s via static target configuration
3. **PrometheusRule** evaluates workstation-specific alert rules
4. **Alertmanager** routes alerts to existing notification channels
## Workstation Deployment
### node_exporter Service
**Installation:**
```bash
pacman -S prometheus-node-exporter
```
**Systemd Configuration:**
- Service: `node_exporter.service`
- User: `node_exporter` (created by package)
- Listen address: `0.0.0.0:9100`
- Restart policy: `always` with 10s delay
- Logging: systemd journal (`journalctl -u node_exporter`)
**ExecStart flags:**
```bash
/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
```
Excludes system mounts to reduce noise.
**Firewall Configuration:**
- Allow TCP 9100 from cluster nodes
- Use ufw or iptables to restrict access
**Metrics Collected:**
All default collectors except resource-intensive ones:
- CPU, memory, filesystem, network
- System stats (uptime, load average, systemd)
- Thermal (if available on hardware)
- Disk I/O
## Prometheus Configuration
### Static Scrape Target
**Job configuration:**
- Job name: `workstation/willlaptop`
- Target: `willlaptop:9100` (DNS resolution) or workstation IP
- Scrape interval: `15s` (matches cluster node_exporter)
- Scrape timeout: `10s`
- Metrics path: `/metrics`
- Honor labels: `true`
**Relabeling rules:**
- Add `env: "workstation"` label for identification
- Preserve `instance: "willlaptop"` from target
**Integration:**
Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
- PrometheusRule with additional scrape config, or
- Direct modification of Prometheus configuration
## Alert Rules
### PrometheusRule Resource
**Namespace:** `monitoring`
**Kind:** `PrometheusRule`
**Labels:** Standard discovery labels for Prometheus operator
### Alert Categories
#### Critical Alerts (Paging)
1. **WorkstationDiskSpaceCritical**
- Condition: `<5%` free on any mounted filesystem
- Duration: 5m
- Severity: critical
2. **WorkstationMemoryCritical**
- Condition: `>95%` memory usage
- Duration: 5m
- Severity: critical
3. **WorkstationCPUCritical**
- Condition: `>95%` CPU usage
- Duration: 10m
- Severity: critical
4. **WorkstationSystemdFailed**
- Condition: Failed systemd units detected
- Duration: 5m
- Severity: critical
#### Warning Alerts (Email/Slack)
1. **WorkstationDiskSpaceWarning**
- Condition: `<10%` free on any mounted filesystem
- Duration: 10m
- Severity: warning
2. **WorkstationMemoryWarning**
- Condition: `>85%` memory usage
- Duration: 10m
- Severity: warning
3. **WorkstationCPUWarning**
- Condition: `>80%` CPU usage
- Duration: 15m
- Severity: warning
4. **WorkstationLoadHigh**
- Condition: 5m load average > # CPU cores
- Duration: 10m
- Severity: warning
5. **WorkstationDiskInodeWarning**
- Condition: `<10%` inodes free
- Duration: 10m
- Severity: warning
6. **WorkstationNetworkErrors**
- Condition: High packet loss or error rate
- Duration: 10m
- Severity: warning
#### Info Alerts (Log Only)
1. **WorkstationDiskSpaceInfo**
- Condition: `<20%` free on any mounted filesystem
- Duration: 15m
- Severity: info
2. **WorkstationUptime**
- Condition: System uptime metric (recording rule)
- Severity: info
### Alert Annotations
Each alert includes:
- `summary`: Brief description
- `description`: Detailed explanation with metric values
- `runbook_url`: Link to troubleshooting documentation (if available)
## Versioning
### Repository Structure
```
~/.claude/repos/homelab/charts/willlaptop-monitoring/
├── prometheus-rules.yaml # PrometheusRule for workstation alerts
├── values.yaml # Configuration values
└── README.md # Documentation
```
### Values.yaml Configuration
Configurable parameters:
```yaml
workstation:
hostname: willlaptop
ip: <workstation_ip> # optional, fallback to DNS
scrape:
interval: 15s
timeout: 10s
alerts:
disk:
critical_percent: 5
warning_percent: 10
info_percent: 20
memory:
critical_percent: 95
warning_percent: 85
cpu:
critical_percent: 95
critical_duration: 10m
warning_percent: 80
warning_duration: 15m
```
### Integration with ArgoCD
Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.
## Testing and Verification
### Phase 1 - Workstation Deployment
1. Verify node_exporter installation:
```bash
pacman -Q prometheus-node-exporter
```
2. Check systemd service status:
```bash
systemctl status node_exporter
```
3. Verify metrics endpoint locally:
```bash
curl http://localhost:9100/metrics | head -20
```
4. Test accessibility from cluster:
```bash
kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
```
### Phase 2 - Prometheus Integration
1. Verify Prometheus target:
- Access Prometheus UI → Targets → workstation/willlaptop
- Confirm target is UP
2. Verify metric ingestion:
```bash
# Query in Prometheus UI
node_up{instance="willlaptop"}
```
3. Verify label injection:
- Confirm `env="workstation"` label appears on metrics
### Phase 3 - Alert Verification
1. Review PrometheusRule:
```bash
kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
```
2. Verify rule evaluation:
- Access Prometheus UI → Rules
- Confirm workstation rules are active
3. Test critical alert:
- Temporarily trigger a low disk alert (or simulate)
- Verify alert fires in Prometheus UI
4. Verify Alertmanager integration:
- Check Alertmanager UI → Alerts
- Confirm workstation alerts are received
## Success Criteria
- [ ] node_exporter running on workstation
- [ ] Metrics accessible from cluster nodes
- [ ] Prometheus scraping workstation metrics
- [ ] Alert rules evaluated and firing correctly
- [ ] Alerts routing through Alertmanager
- [ ] Configuration versioned in homelab repository
- [ ] Documentation complete
## Future Enhancements
- Grafana dashboards for workstation metrics
- Alert tuning based on observed patterns
- Additional collectors (e.g., temperature sensors if available)
- Integration with morning-report skill for health status