Files

OpenCode Test 62050faedc Add workstation monitoring design

2026-01-05 01:31:10 -08:00

8.3 KiB

Raw Blame History

Workstation Monitoring Design

Overview

Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.

Reference: Future consideration fc-001 (workstation monitoring)

Current Infrastructure

Workstation: Arch Linux on MacBookPro9,2 (hostname: willlaptop)
K8s Cluster: kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
Network: Direct network connectivity between workstation and cluster nodes
Existing Monitoring: 3 node_exporters on cluster nodes, cluster-level alerts configured

Architecture

Components

┌─────────────────┐     HTTP/9100      ┌──────────────────────┐
│   Workstation   │ ──────────────────> │   K8s Prometheus    │
│   (willlaptop)  │   scrape every 15s │   (monitoring ns)   │
│                 │                    │                      │
│  node_exporter  │                    │  workstation rules  │
│  systemd service│                    │  + scrape config    │
└─────────────────┘                    └──────────────────────┘
                                                      │
                                                      v
                                         ┌──────────────────────┐
                                         │  Alertmanager        │
                                         │  (existing setup)    │
                                         │  unified routing     │
                                         └──────────────────────┘

Data Flow

node_exporter exposes metrics on http://willlaptop:9100/metrics
Prometheus scrapes metrics every 15s via static target configuration
PrometheusRule evaluates workstation-specific alert rules
Alertmanager routes alerts to existing notification channels

Workstation Deployment

node_exporter Service

Installation:

pacman -S prometheus-node-exporter

Systemd Configuration:

Service: node_exporter.service
User: node_exporter (created by package)
Listen address: 0.0.0.0:9100
Restart policy: always with 10s delay
Logging: systemd journal (journalctl -u node_exporter)

ExecStart flags:

/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)

Excludes system mounts to reduce noise.

Firewall Configuration:

Allow TCP 9100 from cluster nodes
Use ufw or iptables to restrict access

Metrics Collected: All default collectors except resource-intensive ones:

CPU, memory, filesystem, network
System stats (uptime, load average, systemd)
Thermal (if available on hardware)
Disk I/O

Prometheus Configuration

Static Scrape Target

Job configuration:

Job name: workstation/willlaptop
Target: willlaptop:9100 (DNS resolution) or workstation IP
Scrape interval: 15s (matches cluster node_exporter)
Scrape timeout: 10s
Metrics path: /metrics
Honor labels: true

Relabeling rules:

Add env: "workstation" label for identification
Preserve instance: "willlaptop" from target

Integration: Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:

PrometheusRule with additional scrape config, or
Direct modification of Prometheus configuration

Alert Rules

PrometheusRule Resource

Namespace: monitoring Kind: PrometheusRule Labels: Standard discovery labels for Prometheus operator

Alert Categories

Critical Alerts (Paging)

WorkstationDiskSpaceCritical
- Condition: <5% free on any mounted filesystem
- Duration: 5m
- Severity: critical
WorkstationMemoryCritical
- Condition: >95% memory usage
- Duration: 5m
- Severity: critical
WorkstationCPUCritical
- Condition: >95% CPU usage
- Duration: 10m
- Severity: critical
WorkstationSystemdFailed
- Condition: Failed systemd units detected
- Duration: 5m
- Severity: critical

Warning Alerts (Email/Slack)

WorkstationDiskSpaceWarning
- Condition: <10% free on any mounted filesystem
- Duration: 10m
- Severity: warning
WorkstationMemoryWarning
- Condition: >85% memory usage
- Duration: 10m
- Severity: warning
WorkstationCPUWarning
- Condition: >80% CPU usage
- Duration: 15m
- Severity: warning
WorkstationLoadHigh
- Condition: 5m load average > # CPU cores
- Duration: 10m
- Severity: warning
WorkstationDiskInodeWarning
- Condition: <10% inodes free
- Duration: 10m
- Severity: warning
WorkstationNetworkErrors
- Condition: High packet loss or error rate
- Duration: 10m
- Severity: warning

Info Alerts (Log Only)

WorkstationDiskSpaceInfo
- Condition: <20% free on any mounted filesystem
- Duration: 15m
- Severity: info
WorkstationUptime
- Condition: System uptime metric (recording rule)
- Severity: info

Alert Annotations

Each alert includes:

summary: Brief description
description: Detailed explanation with metric values
runbook_url: Link to troubleshooting documentation (if available)

Versioning

Repository Structure

~/.claude/repos/homelab/charts/willlaptop-monitoring/
├── prometheus-rules.yaml         # PrometheusRule for workstation alerts
├── values.yaml                    # Configuration values
└── README.md                     # Documentation

Values.yaml Configuration

Configurable parameters:

workstation:
  hostname: willlaptop
  ip: <workstation_ip>  # optional, fallback to DNS

scrape:
  interval: 15s
  timeout: 10s

alerts:
  disk:
    critical_percent: 5
    warning_percent: 10
    info_percent: 20
  memory:
    critical_percent: 95
    warning_percent: 85
  cpu:
    critical_percent: 95
    critical_duration: 10m
    warning_percent: 80
    warning_duration: 15m

Integration with ArgoCD

Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.

Testing and Verification

Phase 1 - Workstation Deployment

Verify node_exporter installation:
```
pacman -Q prometheus-node-exporter
```
Check systemd service status:
```
systemctl status node_exporter
```

Verify metrics endpoint locally:

curl http://localhost:9100/metrics | head -20

Test accessibility from cluster:

kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics

Phase 2 - Prometheus Integration

Verify Prometheus target:
- Access Prometheus UI → Targets → workstation/willlaptop
- Confirm target is UP

Verify metric ingestion:

# Query in Prometheus UI
node_up{instance="willlaptop"}

Verify label injection:
- Confirm env="workstation" label appears on metrics

Phase 3 - Alert Verification

Review PrometheusRule:

kubectl get prometheusrule workstation-alerts -n monitoring -o yaml

Verify rule evaluation:
- Access Prometheus UI → Rules
- Confirm workstation rules are active
Test critical alert:
- Temporarily trigger a low disk alert (or simulate)
- Verify alert fires in Prometheus UI
Verify Alertmanager integration:
- Check Alertmanager UI → Alerts
- Confirm workstation alerts are received

Success Criteria

node_exporter running on workstation
Metrics accessible from cluster nodes
Prometheus scraping workstation metrics
Alert rules evaluated and firing correctly
Alerts routing through Alertmanager
Configuration versioned in homelab repository
Documentation complete

Future Enhancements

Grafana dashboards for workstation metrics
Alert tuning based on observed patterns
Additional collectors (e.g., temperature sensors if available)
Integration with morning-report skill for health status

8.3 KiB Raw Blame History