# Workstation Monitoring Design

## Overview

Deploy comprehensive monitoring for the Arch Linux workstation (willlaptop) by integrating with the existing k8s monitoring stack. This will enable proactive alerting for resource exhaustion, long-term capacity planning, and performance debugging.

**Reference:** Future consideration `fc-001` (workstation monitoring)

## Current Infrastructure

- **Workstation:** Arch Linux on MacBookPro9,2 (hostname: willlaptop)
- **K8s Cluster:** kube-prometheus-stack deployed with Prometheus, Alertmanager, Grafana
- **Network:** Direct network connectivity between workstation and cluster nodes
- **Existing Monitoring:** 3 node_exporters on cluster nodes, cluster-level alerts configured

## Architecture

### Components

```
┌─────────────────┐     HTTP/9100      ┌──────────────────────┐
│   Workstation   │ ──────────────────> │   K8s Prometheus    │
│   (willlaptop)  │   scrape every 15s │   (monitoring ns)   │
│                 │                    │                      │
│  node_exporter  │                    │  workstation rules  │
│  systemd service│                    │  + scrape config    │
└─────────────────┘                    └──────────────────────┘
                                                      │
                                                      v
                                         ┌──────────────────────┐
                                         │  Alertmanager        │
                                         │  (existing setup)    │
                                         │  unified routing     │
                                         └──────────────────────┘
```

### Data Flow

1. **node_exporter** exposes metrics on `http://willlaptop:9100/metrics`
2. **Prometheus** scrapes metrics every 15s via static target configuration
3. **PrometheusRule** evaluates workstation-specific alert rules
4. **Alertmanager** routes alerts to existing notification channels

## Workstation Deployment

### node_exporter Service

**Installation:**
```bash
pacman -S prometheus-node-exporter
```

**Systemd Configuration:**
- Service: `node_exporter.service`
- User: `node_exporter` (created by package)
- Listen address: `0.0.0.0:9100`
- Restart policy: `always` with 10s delay
- Logging: systemd journal (`journalctl -u node_exporter`)

**ExecStart flags:**
```bash
/usr/bin/node_exporter --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)
```

Excludes system mounts to reduce noise.

**Firewall Configuration:**
- Allow TCP 9100 from cluster nodes
- Use ufw or iptables to restrict access

**Metrics Collected:**
All default collectors except resource-intensive ones:
- CPU, memory, filesystem, network
- System stats (uptime, load average, systemd)
- Thermal (if available on hardware)
- Disk I/O

## Prometheus Configuration

### Static Scrape Target

**Job configuration:**
- Job name: `workstation/willlaptop`
- Target: `willlaptop:9100` (DNS resolution) or workstation IP
- Scrape interval: `15s` (matches cluster node_exporter)
- Scrape timeout: `10s`
- Metrics path: `/metrics`
- Honor labels: `true`

**Relabeling rules:**
- Add `env: "workstation"` label for identification
- Preserve `instance: "willlaptop"` from target

**Integration:**
Add to existing Prometheus CRD configuration in kube-prometheus-stack. This can be done via:
- PrometheusRule with additional scrape config, or
- Direct modification of Prometheus configuration

## Alert Rules

### PrometheusRule Resource

**Namespace:** `monitoring`
**Kind:** `PrometheusRule`
**Labels:** Standard discovery labels for Prometheus operator

### Alert Categories

#### Critical Alerts (Paging)

1. **WorkstationDiskSpaceCritical**
   - Condition: `<5%` free on any mounted filesystem
   - Duration: 5m
   - Severity: critical

2. **WorkstationMemoryCritical**
   - Condition: `>95%` memory usage
   - Duration: 5m
   - Severity: critical

3. **WorkstationCPUCritical**
   - Condition: `>95%` CPU usage
   - Duration: 10m
   - Severity: critical

4. **WorkstationSystemdFailed**
   - Condition: Failed systemd units detected
   - Duration: 5m
   - Severity: critical

#### Warning Alerts (Email/Slack)

1. **WorkstationDiskSpaceWarning**
   - Condition: `<10%` free on any mounted filesystem
   - Duration: 10m
   - Severity: warning

2. **WorkstationMemoryWarning**
   - Condition: `>85%` memory usage
   - Duration: 10m
   - Severity: warning

3. **WorkstationCPUWarning**
   - Condition: `>80%` CPU usage
   - Duration: 15m
   - Severity: warning

4. **WorkstationLoadHigh**
   - Condition: 5m load average > # CPU cores
   - Duration: 10m
   - Severity: warning

5. **WorkstationDiskInodeWarning**
   - Condition: `<10%` inodes free
   - Duration: 10m
   - Severity: warning

6. **WorkstationNetworkErrors**
   - Condition: High packet loss or error rate
   - Duration: 10m
   - Severity: warning

#### Info Alerts (Log Only)

1. **WorkstationDiskSpaceInfo**
   - Condition: `<20%` free on any mounted filesystem
   - Duration: 15m
   - Severity: info

2. **WorkstationUptime**
   - Condition: System uptime metric (recording rule)
   - Severity: info

### Alert Annotations

Each alert includes:
- `summary`: Brief description
- `description`: Detailed explanation with metric values
- `runbook_url`: Link to troubleshooting documentation (if available)

## Versioning

### Repository Structure

```
~/.claude/repos/homelab/charts/willlaptop-monitoring/
├── prometheus-rules.yaml         # PrometheusRule for workstation alerts
├── values.yaml                    # Configuration values
└── README.md                     # Documentation
```

### Values.yaml Configuration

Configurable parameters:
```yaml
workstation:
  hostname: willlaptop
  ip: <workstation_ip>  # optional, fallback to DNS

scrape:
  interval: 15s
  timeout: 10s

alerts:
  disk:
    critical_percent: 5
    warning_percent: 10
    info_percent: 20
  memory:
    critical_percent: 95
    warning_percent: 85
  cpu:
    critical_percent: 95
    critical_duration: 10m
    warning_percent: 80
    warning_duration: 15m
```

### Integration with ArgoCD

Follows existing GitOps pattern (charts/kube-prometheus-stack). Can be added to ArgoCD for automated deployments if desired.

## Testing and Verification

### Phase 1 - Workstation Deployment

1. Verify node_exporter installation:
   ```bash
   pacman -Q prometheus-node-exporter
   ```

2. Check systemd service status:
   ```bash
   systemctl status node_exporter
   ```

3. Verify metrics endpoint locally:
   ```bash
   curl http://localhost:9100/metrics | head -20
   ```

4. Test accessibility from cluster:
   ```bash
   kubectl run -it --rm debug --image=curlimages/curl -- curl willlaptop:9100/metrics
   ```

### Phase 2 - Prometheus Integration

1. Verify Prometheus target:
   - Access Prometheus UI → Targets → workstation/willlaptop
   - Confirm target is UP

2. Verify metric ingestion:
   ```bash
   # Query in Prometheus UI
   node_up{instance="willlaptop"}
   ```

3. Verify label injection:
   - Confirm `env="workstation"` label appears on metrics

### Phase 3 - Alert Verification

1. Review PrometheusRule:
   ```bash
   kubectl get prometheusrule workstation-alerts -n monitoring -o yaml
   ```

2. Verify rule evaluation:
   - Access Prometheus UI → Rules
   - Confirm workstation rules are active

3. Test critical alert:
   - Temporarily trigger a low disk alert (or simulate)
   - Verify alert fires in Prometheus UI

4. Verify Alertmanager integration:
   - Check Alertmanager UI → Alerts
   - Confirm workstation alerts are received

## Success Criteria

- [ ] node_exporter running on workstation
- [ ] Metrics accessible from cluster nodes
- [ ] Prometheus scraping workstation metrics
- [ ] Alert rules evaluated and firing correctly
- [ ] Alerts routing through Alertmanager
- [ ] Configuration versioned in homelab repository
- [ ] Documentation complete

## Future Enhancements

- Grafana dashboards for workstation metrics
- Alert tuning based on observed patterns
- Additional collectors (e.g., temperature sensors if available)
- Integration with morning-report skill for health status