Files
claude-code/docs/plans/2026-01-05-workstation-monitoring-design.md
2026-01-05 01:31:10 -08:00

8.3 KiB

Workstation Monitoring Design

Overview

Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.

Reference: Future consideration fc-001 (workstation monitoring)

Current Infrastructure

  • Workstation: Arch Linux on MacBookPro9,2 (hostname: willlaptop)
  • K8s Cluster: kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
  • Network: Direct network connectivity between workstation and cluster nodes
  • Existing Monitoring: 3 node_exporters on cluster nodes, cluster-level alerts configured

Architecture

Components

┌─────────────────┐     HTTP/9100      ┌──────────────────────┐
│   Workstation   │ ──────────────────> │   K8s Prometheus    │
│   (willlaptop)  │   scrape every 15s │   (monitoring ns)   │
│                 │                    │                      │
│  node_exporter  │                    │  workstation rules  │
│  systemd service│                    │  + scrape config    │
└─────────────────┘                    └──────────────────────┘
                                                      │
                                                      v
                                         ┌──────────────────────┐
                                         │  Alertmanager        │
                                         │  (existing setup)    │
                                         │  unified routing     │
                                         └──────────────────────┘

Data Flow

  1. node_exporter exposes metrics on http://willlaptop:9100/metrics
  2. Prometheus scrapes metrics every 15s via static target configuration
  3. PrometheusRule evaluates workstation-specific alert rules
  4. Alertmanager routes alerts to existing notification channels

Workstation Deployment

node_exporter Service

Installation:

pacman -S prometheus-node-exporter

Systemd Configuration:

  • Service: node_exporter.service
  • User: node_exporter (created by package)
  • Listen address: 0.0.0.0:9100
  • Restart policy: always with 10s delay
  • Logging: systemd journal (journalctl -u node_exporter)

ExecStart flags:

/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)

Excludes system mounts to reduce noise.

Firewall Configuration:

  • Allow TCP 9100 from cluster nodes
  • Use ufw or iptables to restrict access

Metrics Collected: All default collectors except resource-intensive ones:

  • CPU, memory, filesystem, network
  • System stats (uptime, load average, systemd)
  • Thermal (if available on hardware)
  • Disk I/O

Prometheus Configuration

Static Scrape Target

Job configuration:

  • Job name: workstation/willlaptop
  • Target: willlaptop:9100 (DNS resolution) or workstation IP
  • Scrape interval: 15s (matches cluster node_exporter)
  • Scrape timeout: 10s
  • Metrics path: /metrics
  • Honor labels: true

Relabeling rules:

  • Add env: "workstation" label for identification
  • Preserve instance: "willlaptop" from target

Integration: Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:

  • PrometheusRule with additional scrape config, or
  • Direct modification of Prometheus configuration

Alert Rules

PrometheusRule Resource

Namespace: monitoring Kind: PrometheusRule Labels: Standard discovery labels for Prometheus operator

Alert Categories

Critical Alerts (Paging)

  1. WorkstationDiskSpaceCritical

    • Condition: <5% free on any mounted filesystem
    • Duration: 5m
    • Severity: critical
  2. WorkstationMemoryCritical

    • Condition: >95% memory usage
    • Duration: 5m
    • Severity: critical
  3. WorkstationCPUCritical

    • Condition: >95% CPU usage
    • Duration: 10m
    • Severity: critical
  4. WorkstationSystemdFailed

    • Condition: Failed systemd units detected
    • Duration: 5m
    • Severity: critical

Warning Alerts (Email/Slack)

  1. WorkstationDiskSpaceWarning

    • Condition: <10% free on any mounted filesystem
    • Duration: 10m
    • Severity: warning
  2. WorkstationMemoryWarning

    • Condition: >85% memory usage
    • Duration: 10m
    • Severity: warning
  3. WorkstationCPUWarning

    • Condition: >80% CPU usage
    • Duration: 15m
    • Severity: warning
  4. WorkstationLoadHigh

    • Condition: 5m load average > # CPU cores
    • Duration: 10m
    • Severity: warning
  5. WorkstationDiskInodeWarning

    • Condition: <10% inodes free
    • Duration: 10m
    • Severity: warning
  6. WorkstationNetworkErrors

    • Condition: High packet loss or error rate
    • Duration: 10m
    • Severity: warning

Info Alerts (Log Only)

  1. WorkstationDiskSpaceInfo

    • Condition: <20% free on any mounted filesystem
    • Duration: 15m
    • Severity: info
  2. WorkstationUptime

    • Condition: System uptime metric (recording rule)
    • Severity: info

Alert Annotations

Each alert includes:

  • summary: Brief description
  • description: Detailed explanation with metric values
  • runbook_url: Link to troubleshooting documentation (if available)

Versioning

Repository Structure

~/.claude/repos/homelab/charts/willlaptop-monitoring/
├── prometheus-rules.yaml         # PrometheusRule for workstation alerts
├── values.yaml                    # Configuration values
└── README.md                     # Documentation

Values.yaml Configuration

Configurable parameters:

workstation:
  hostname: willlaptop
  ip: <workstation_ip>  # optional, fallback to DNS

scrape:
  interval: 15s
  timeout: 10s

alerts:
  disk:
    critical_percent: 5
    warning_percent: 10
    info_percent: 20
  memory:
    critical_percent: 95
    warning_percent: 85
  cpu:
    critical_percent: 95
    critical_duration: 10m
    warning_percent: 80
    warning_duration: 15m

Integration with ArgoCD

Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.

Testing and Verification

Phase 1 - Workstation Deployment

  1. Verify node_exporter installation:

    pacman -Q prometheus-node-exporter
    
  2. Check systemd service status:

    systemctl status node_exporter
    
  3. Verify metrics endpoint locally:

    curl http://localhost:9100/metrics | head -20
    
  4. Test accessibility from cluster:

    kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
    

Phase 2 - Prometheus Integration

  1. Verify Prometheus target:

    • Access Prometheus UI → Targets → workstation/willlaptop
    • Confirm target is UP
  2. Verify metric ingestion:

    # Query in Prometheus UI
    node_up{instance="willlaptop"}
    
  3. Verify label injection:

    • Confirm env="workstation" label appears on metrics

Phase 3 - Alert Verification

  1. Review PrometheusRule:

    kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
    
  2. Verify rule evaluation:

    • Access Prometheus UI → Rules
    • Confirm workstation rules are active
  3. Test critical alert:

    • Temporarily trigger a low disk alert (or simulate)
    • Verify alert fires in Prometheus UI
  4. Verify Alertmanager integration:

    • Check Alertmanager UI → Alerts
    • Confirm workstation alerts are received

Success Criteria

  • node_exporter running on workstation
  • Metrics accessible from cluster nodes
  • Prometheus scraping workstation metrics
  • Alert rules evaluated and firing correctly
  • Alerts routing through Alertmanager
  • Configuration versioned in homelab repository
  • Documentation complete

Future Enhancements

  • Grafana dashboards for workstation metrics
  • Alert tuning based on observed patterns
  • Additional collectors (e.g., temperature sensors if available)
  • Integration with morning-report skill for health status