Automation components for scheduled and event-driven workflows:
Scheduler:
- scheduler.sh for cron-based workflow execution
- Logs workflow runs to ~/.claude/logs/workflows/
- Notifies dashboard on completion
Alertmanager Integration:
- webhook-receiver.sh for processing alerts
- Dashboard endpoint /api/webhooks/alertmanager
- Example alertmanager-config.yaml with routing rules
- Maps alerts to workflows (crashloop, node issues, resources)
New Incident Workflows:
- node-issue-response.yaml: Handle NotReady/unreachable nodes
- resource-pressure-response.yaml: Respond to memory/CPU overcommit
- argocd-sync-failure.yaml: Investigate and fix sync failures
Dashboard Updates:
- POST /api/webhooks/alertmanager endpoint
- POST /api/workflows/{name}/complete endpoint
- Alerts create pending actions for visibility
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
98 lines
2.4 KiB
YAML
98 lines
2.4 KiB
YAML
name: resource-pressure-response
|
|
description: Respond to cluster resource pressure alerts
|
|
version: "1.0"
|
|
|
|
trigger:
|
|
- alert:
|
|
match:
|
|
alertname: KubeMemoryOvercommit
|
|
- alert:
|
|
match:
|
|
alertname: KubeCPUOvercommit
|
|
- manual: true
|
|
|
|
defaults:
|
|
model: sonnet
|
|
|
|
steps:
|
|
- name: assess-pressure
|
|
agent: prometheus-analyst
|
|
model: sonnet
|
|
task: |
|
|
Assess current resource pressure:
|
|
- Per-node CPU usage and requests vs limits
|
|
- Per-node memory usage and requests vs limits
|
|
- Identify nodes under most pressure
|
|
- Check for OOM events in last hour
|
|
|
|
Focus on Pi cluster constraints:
|
|
- Pi 5 (8GB): Higher capacity
|
|
- Pi 3 (1GB): Very limited, check if overloaded
|
|
output: pressure_analysis
|
|
|
|
- name: identify-hogs
|
|
agent: k8s-diagnostician
|
|
model: haiku
|
|
task: |
|
|
Identify resource-heavy workloads:
|
|
- Top 5 pods by CPU usage
|
|
- Top 5 pods by memory usage
|
|
- Any pods exceeding their requests
|
|
- Any pods with no limits set
|
|
output: resource_hogs
|
|
|
|
- name: check-scaling
|
|
agent: argocd-operator
|
|
model: haiku
|
|
task: |
|
|
Check if any deployments can be scaled:
|
|
- List deployments with >1 replica
|
|
- Check HPA configurations
|
|
- Identify candidates for scale-down
|
|
output: scaling_options
|
|
|
|
- name: recommend-actions
|
|
agent: k8s-orchestrator
|
|
model: sonnet
|
|
task: |
|
|
Recommend resource optimization actions:
|
|
|
|
Analysis:
|
|
- Pressure: {{ steps.assess-pressure.output }}
|
|
- Top consumers: {{ steps.identify-hogs.output }}
|
|
- Scaling options: {{ steps.check-scaling.output }}
|
|
|
|
Prioritize actions by impact and safety:
|
|
|
|
[SAFE] - Can be auto-applied:
|
|
- Clean up completed jobs/pods
|
|
- Identify and report issues
|
|
|
|
[CONFIRM] - Require approval:
|
|
- Scale down non-critical deployments
|
|
- Adjust resource limits
|
|
- Evict low-priority pods
|
|
|
|
[FORBIDDEN] - Never auto-apply:
|
|
- Delete PVCs
|
|
- Delete critical workloads
|
|
output: recommendations
|
|
|
|
- name: cleanup
|
|
agent: k8s-diagnostician
|
|
model: haiku
|
|
task: |
|
|
Perform safe cleanup actions:
|
|
- Delete completed jobs older than 1 hour
|
|
- Delete succeeded pods
|
|
- Delete failed pods older than 24 hours
|
|
|
|
Report what was cleaned up.
|
|
output: cleanup_result
|
|
confirm: false
|
|
|
|
outputs:
|
|
- pressure_analysis
|
|
- recommendations
|
|
- cleanup_result
|