feat: Implement Phase 1 K8s agent orchestrator system
Core agent system for Raspberry Pi k0s cluster management: Agents: - k8s-orchestrator: Central task delegation and decision making - k8s-diagnostician: Cluster health, logs, troubleshooting - argocd-operator: GitOps deployments and rollbacks - prometheus-analyst: Metrics queries and alert analysis - git-operator: Manifest management and PR workflows Workflows: - cluster-health-check.yaml: Scheduled health assessment - deploy-app.md: Application deployment guide - pod-crashloop.yaml: Automated incident response Skills: - /cluster-status: Quick health overview - /deploy: Deploy or update applications - /diagnose: Investigate cluster issues Configuration: - Agent definitions with model assignments (Opus/Sonnet) - Autonomy rules (safe/confirm/forbidden actions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
79
workflows/health/cluster-health-check.yaml
Normal file
79
workflows/health/cluster-health-check.yaml
Normal file
@@ -0,0 +1,79 @@
|
||||
name: cluster-health-check
|
||||
description: Comprehensive cluster health assessment
|
||||
version: "1.0"
|
||||
|
||||
trigger:
|
||||
- schedule: "0 */6 * * *" # every 6 hours
|
||||
- manual: true
|
||||
|
||||
defaults:
|
||||
model: sonnet
|
||||
|
||||
steps:
|
||||
- name: check-nodes
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Get node status for all nodes:
|
||||
- Check node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
|
||||
- Report any nodes not in Ready state
|
||||
- Check resource usage with kubectl top nodes
|
||||
output: node_status
|
||||
|
||||
- name: check-pods
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Get pod status across all namespaces:
|
||||
- Count pods by status (Running, Pending, Failed, CrashLoopBackOff)
|
||||
- List any unhealthy pods with their namespace and reason
|
||||
- Check for high restart counts (>5 in last hour)
|
||||
output: pod_status
|
||||
|
||||
- name: check-metrics
|
||||
agent: prometheus-analyst
|
||||
model: haiku
|
||||
task: |
|
||||
Query key cluster metrics:
|
||||
- Node CPU and memory usage (current and 1h average)
|
||||
- Top 5 pods by CPU usage
|
||||
- Top 5 pods by memory usage
|
||||
- Any active firing alerts
|
||||
output: metrics_summary
|
||||
|
||||
- name: check-argocd
|
||||
agent: argocd-operator
|
||||
model: haiku
|
||||
task: |
|
||||
Check ArgoCD application status:
|
||||
- List all applications with sync and health status
|
||||
- Report any apps that are OutOfSync or Degraded
|
||||
- Note last sync time for each app
|
||||
output: argocd_status
|
||||
|
||||
- name: analyze-and-report
|
||||
agent: k8s-orchestrator
|
||||
model: sonnet
|
||||
task: |
|
||||
Analyze the health check results and create a summary report:
|
||||
|
||||
Inputs:
|
||||
- Node status: {{ steps.check-nodes.output }}
|
||||
- Pod status: {{ steps.check-pods.output }}
|
||||
- Metrics: {{ steps.check-metrics.output }}
|
||||
- ArgoCD: {{ steps.check-argocd.output }}
|
||||
|
||||
Create a report with:
|
||||
1. Overall cluster health (Healthy/Degraded/Critical)
|
||||
2. Summary table of key metrics
|
||||
3. List of issues found (if any)
|
||||
4. Recommended actions (mark as safe/confirm)
|
||||
|
||||
If issues are critical, propose immediate remediation steps.
|
||||
output: health_report
|
||||
confirm_if: actions_proposed
|
||||
|
||||
outputs:
|
||||
- health_report
|
||||
- node_status
|
||||
- pod_status
|
||||
Reference in New Issue
Block a user