feat: Implement Phase 1 K8s agent orchestrator system

Core agent system for Raspberry Pi k0s cluster management: Agents: - k8s-orchestrator: Central task delegation and decision making - k8s-diagnostician: Cluster health, logs, troubleshooting - argocd-operator: GitOps deployments and rollbacks - prometheus-analyst: Metrics queries and alert analysis - git-operator: Manifest management and PR workflows Workflows: - cluster-health-check.yaml: Scheduled health assessment - deploy-app.md: Application deployment guide - pod-crashloop.yaml: Automated incident response Skills: - /cluster-status: Quick health overview - /deploy: Deploy or update applications - /diagnose: Investigate cluster issues Configuration: - Agent definitions with model assignments (Opus/Sonnet) - Autonomy rules (safe/confirm/forbidden actions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:25:11 -08:00
parent 216a95cec4
commit a80f714fc2
12 changed files with 1302 additions and 1 deletions
--- a/workflows/deploy/deploy-app.md
+++ b/workflows/deploy/deploy-app.md
@@ -0,0 +1,97 @@
+# Deploy Application Workflow
+
+A simple workflow for deploying new applications or updating existing ones.
+
+## When to use
+
+Use this workflow when:
+- Deploying a new application to the cluster
+- Updating an existing application's configuration
+- Rolling out a new version of an application
+
+## Steps
+
+### 1. Gather Requirements
+
+Ask the user for:
+- Application name
+- Container image and tag
+- Namespace (default: `default`)
+- Resource requirements (CPU/memory limits)
+- Exposed ports
+- Any special requirements (tolerations for Pi 3, etc.)
+
+### 2. Check Existing State
+
+Delegate to **argocd-operator** (haiku):
+- Check if application already exists in ArgoCD
+- If exists, get current status and version
+
+Delegate to **k8s-diagnostician** (haiku):
+- If exists, check current pod status
+- Check namespace exists
+
+### 3. Create/Update Manifests
+
+Delegate to **git-operator** (sonnet):
+- Create or update deployment manifest
+- Create or update service manifest (if ports exposed)
+- Create or update kustomization.yaml
+- Include appropriate resource limits for Pi cluster:
+  ```yaml
+  resources:
+    requests:
+      memory: "64Mi"
+      cpu: "50m"
+    limits:
+      memory: "128Mi"
+      cpu: "200m"
+  ```
+- If targeting Pi 3, add tolerations:
+  ```yaml
+  tolerations:
+    - key: "node-type"
+      operator: "Equal"
+      value: "pi3"
+      effect: "NoSchedule"
+  ```
+
+### 4. Commit Changes
+
+Delegate to **git-operator** (sonnet):
+- Create feature branch: `deploy/<app-name>`
+- Commit with message: `feat: deploy <app-name>`
+- Push branch to origin
+- Create pull request
+
+**[CONFIRM]** User must approve the PR creation.
+
+### 5. Sync Application
+
+After PR is merged:
+
+Delegate to **argocd-operator** (sonnet):
+- Create ArgoCD application if new
+- Trigger sync for the application
+- Wait for sync to complete
+
+**[CONFIRM]** User must approve the sync operation.
+
+### 6. Verify Deployment
+
+Delegate to **k8s-diagnostician** (haiku):
+- Check pods are running
+- Check no restart loops
+- Verify resource usage is within limits
+
+Report final status to user.
+
+## Rollback
+
+If deployment fails:
+
+Delegate to **argocd-operator**:
+- Check application history
+- Propose rollback to previous version
+
+**[CONFIRM]** User must approve rollback.
--- a/workflows/health/cluster-health-check.yaml
+++ b/workflows/health/cluster-health-check.yaml
@@ -0,0 +1,79 @@
+name: cluster-health-check
+description: Comprehensive cluster health assessment
+version: "1.0"
+
+trigger:
+  - schedule: "0 */6 * * *"  # every 6 hours
+  - manual: true
+
+defaults:
+  model: sonnet
+
+steps:
+  - name: check-nodes
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Get node status for all nodes:
+      - Check node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
+      - Report any nodes not in Ready state
+      - Check resource usage with kubectl top nodes
+    output: node_status
+
+  - name: check-pods
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Get pod status across all namespaces:
+      - Count pods by status (Running, Pending, Failed, CrashLoopBackOff)
+      - List any unhealthy pods with their namespace and reason
+      - Check for high restart counts (>5 in last hour)
+    output: pod_status
+
+  - name: check-metrics
+    agent: prometheus-analyst
+    model: haiku
+    task: |
+      Query key cluster metrics:
+      - Node CPU and memory usage (current and 1h average)
+      - Top 5 pods by CPU usage
+      - Top 5 pods by memory usage
+      - Any active firing alerts
+    output: metrics_summary
+
+  - name: check-argocd
+    agent: argocd-operator
+    model: haiku
+    task: |
+      Check ArgoCD application status:
+      - List all applications with sync and health status
+      - Report any apps that are OutOfSync or Degraded
+      - Note last sync time for each app
+    output: argocd_status
+
+  - name: analyze-and-report
+    agent: k8s-orchestrator
+    model: sonnet
+    task: |
+      Analyze the health check results and create a summary report:
+
+      Inputs:
+      - Node status: {{ steps.check-nodes.output }}
+      - Pod status: {{ steps.check-pods.output }}
+      - Metrics: {{ steps.check-metrics.output }}
+      - ArgoCD: {{ steps.check-argocd.output }}
+
+      Create a report with:
+      1. Overall cluster health (Healthy/Degraded/Critical)
+      2. Summary table of key metrics
+      3. List of issues found (if any)
+      4. Recommended actions (mark as safe/confirm)
+
+      If issues are critical, propose immediate remediation steps.
+    output: health_report
+    confirm_if: actions_proposed
+
+outputs:
+  - health_report
+  - node_status
+  - pod_status
--- a/workflows/incidents/pod-crashloop.yaml
+++ b/workflows/incidents/pod-crashloop.yaml
@@ -0,0 +1,140 @@
+name: pod-crashloop-remediation
+description: Diagnose and remediate pods in CrashLoopBackOff
+version: "1.0"
+
+trigger:
+  - alert:
+      match:
+        alertname: KubePodCrashLooping
+  - manual: true
+    inputs:
+      - name: namespace
+        description: Pod namespace
+        required: true
+      - name: pod
+        description: Pod name (or prefix)
+        required: true
+
+defaults:
+  model: sonnet
+
+steps:
+  - name: identify-pod
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Identify the crashing pod:
+      - Namespace: {{ inputs.namespace | default(alert.labels.namespace) }}
+      - Pod: {{ inputs.pod | default(alert.labels.pod) }}
+
+      Get pod details:
+      - Current status and restart count
+      - Last restart reason
+      - Container statuses
+    output: pod_info
+
+  - name: analyze-logs
+    agent: k8s-diagnostician
+    model: sonnet
+    task: |
+      Analyze pod logs for crash cause:
+      - Get current container logs (last 50 lines)
+      - Get previous container logs if available
+      - Look for error patterns:
+        - OOMKilled (exit code 137)
+        - Segfault (exit code 139)
+        - Application errors
+        - Configuration errors
+        - Dependency failures
+
+      Pod info: {{ steps.identify-pod.output }}
+    output: log_analysis
+
+  - name: check-resources
+    agent: prometheus-analyst
+    model: haiku
+    task: |
+      Check resource usage before crash:
+      - Memory usage trend (last 30 min)
+      - CPU usage trend (last 30 min)
+      - Compare to resource limits
+
+      Pod: {{ steps.identify-pod.output.pod_name }}
+      Namespace: {{ steps.identify-pod.output.namespace }}
+    output: resource_analysis
+
+  - name: check-dependencies
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Check pod dependencies:
+      - ConfigMaps and Secrets exist?
+      - PVCs bound?
+      - Service account valid?
+      - Init containers completed?
+
+      Pod info: {{ steps.identify-pod.output }}
+    output: dependency_check
+
+  - name: diagnose-and-recommend
+    agent: k8s-orchestrator
+    model: sonnet
+    task: |
+      Analyze all findings and determine root cause:
+
+      Evidence:
+      - Pod info: {{ steps.identify-pod.output }}
+      - Log analysis: {{ steps.analyze-logs.output }}
+      - Resource usage: {{ steps.check-resources.output }}
+      - Dependencies: {{ steps.check-dependencies.output }}
+
+      Determine:
+      1. Root cause (OOM, config error, dependency, application bug, etc.)
+      2. Severity (auto-recoverable, needs intervention, critical)
+      3. Recommended actions
+
+      Action classification:
+      - [SAFE] Restart pod, clear stuck jobs
+      - [CONFIRM] Increase resources, modify config
+      - [FORBIDDEN] Delete PVC, delete namespace
+    output: diagnosis
+
+  - name: apply-safe-remediation
+    condition: "{{ steps.diagnose-and-recommend.output.has_safe_action }}"
+    agent: k8s-diagnostician
+    model: haiku
+    task: |
+      Apply safe remediation actions:
+      {{ steps.diagnose-and-recommend.output.safe_actions }}
+
+      Report what was done.
+    output: safe_actions_result
+
+  - name: propose-confirm-actions
+    condition: "{{ steps.diagnose-and-recommend.output.has_confirm_action }}"
+    agent: k8s-orchestrator
+    model: haiku
+    task: |
+      Present actions requiring confirmation:
+
+      {{ steps.diagnose-and-recommend.output.confirm_actions }}
+
+      For each action, explain:
+      - What will change
+      - Potential impact
+      - Rollback option
+    output: confirm_proposal
+    confirm: true
+
+outputs:
+  - diagnosis
+  - safe_actions_result
+  - confirm_proposal
+
+notifications:
+  on_complete:
+    summary: |
+      CrashLoop remediation for {{ steps.identify-pod.output.pod_name }}:
+      - Root cause: {{ steps.diagnose-and-recommend.output.root_cause }}
+      - Actions taken: {{ steps.safe_actions_result.actions | default('none') }}
+      - Pending approval: {{ steps.confirm_proposal | default('none') }}