feat: Implement Phase 1 K8s agent orchestrator system

Core agent system for Raspberry Pi k0s cluster management: Agents: - k8s-orchestrator: Central task delegation and decision making - k8s-diagnostician: Cluster health, logs, troubleshooting - argocd-operator: GitOps deployments and rollbacks - prometheus-analyst: Metrics queries and alert analysis - git-operator: Manifest management and PR workflows Workflows: - cluster-health-check.yaml: Scheduled health assessment - deploy-app.md: Application deployment guide - pod-crashloop.yaml: Automated incident response Skills: - /cluster-status: Quick health overview - /deploy: Deploy or update applications - /diagnose: Investigate cluster issues Configuration: - Agent definitions with model assignments (Opus/Sonnet) - Autonomy rules (safe/confirm/forbidden actions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:25:11 -08:00
parent 216a95cec4
commit a80f714fc2
12 changed files with 1302 additions and 1 deletions
--- a/workflows/health/cluster-health-check.yaml
+++ b/workflows/health/cluster-health-check.yaml
@@ -0,0 +1,79 @@
+name: cluster-health-check
+description: Comprehensive cluster health assessment
+version: "1.0"
+
+trigger:
+  - schedule: "0 */6 * * *"  # every 6 hours
+  - manual: true
+
+defaults:
+  model: sonnet
+
+steps:
+  - name: check-nodes
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Get node status for all nodes:
+      - Check node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
+      - Report any nodes not in Ready state
+      - Check resource usage with kubectl top nodes
+    output: node_status
+
+  - name: check-pods
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Get pod status across all namespaces:
+      - Count pods by status (Running, Pending, Failed, CrashLoopBackOff)
+      - List any unhealthy pods with their namespace and reason
+      - Check for high restart counts (>5 in last hour)
+    output: pod_status
+
+  - name: check-metrics
+    agent: prometheus-analyst
+    model: haiku
+    task: |
+      Query key cluster metrics:
+      - Node CPU and memory usage (current and 1h average)
+      - Top 5 pods by CPU usage
+      - Top 5 pods by memory usage
+      - Any active firing alerts
+    output: metrics_summary
+
+  - name: check-argocd
+    agent: argocd-operator
+    model: haiku
+    task: |
+      Check ArgoCD application status:
+      - List all applications with sync and health status
+      - Report any apps that are OutOfSync or Degraded
+      - Note last sync time for each app
+    output: argocd_status
+
+  - name: analyze-and-report
+    agent: k8s-orchestrator
+    model: sonnet
+    task: |
+      Analyze the health check results and create a summary report:
+
+      Inputs:
+      - Node status: {{ steps.check-nodes.output }}
+      - Pod status: {{ steps.check-pods.output }}
+      - Metrics: {{ steps.check-metrics.output }}
+      - ArgoCD: {{ steps.check-argocd.output }}
+
+      Create a report with:
+      1. Overall cluster health (Healthy/Degraded/Critical)
+      2. Summary table of key metrics
+      3. List of issues found (if any)
+      4. Recommended actions (mark as safe/confirm)
+
+      If issues are critical, propose immediate remediation steps.
+    output: health_report
+    confirm_if: actions_proposed
+
+outputs:
+  - health_report
+  - node_status
+  - pod_status