feat: Implement Phase 1 K8s agent orchestrator system

Core agent system for Raspberry Pi k0s cluster management:

Agents:
- k8s-orchestrator: Central task delegation and decision making
- k8s-diagnostician: Cluster health, logs, troubleshooting
- argocd-operator: GitOps deployments and rollbacks
- prometheus-analyst: Metrics queries and alert analysis
- git-operator: Manifest management and PR workflows

Workflows:
- cluster-health-check.yaml: Scheduled health assessment
- deploy-app.md: Application deployment guide
- pod-crashloop.yaml: Automated incident response

Skills:
- /cluster-status: Quick health overview
- /deploy: Deploy or update applications
- /diagnose: Investigate cluster issues

Configuration:
- Agent definitions with model assignments (Opus/Sonnet)
- Autonomy rules (safe/confirm/forbidden actions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
OpenCode Test
2025-12-26 11:25:11 -08:00
parent 216a95cec4
commit a80f714fc2
12 changed files with 1302 additions and 1 deletions

View File

@@ -0,0 +1,97 @@
# Deploy Application Workflow
A simple workflow for deploying new applications or updating existing ones.
## When to use
Use this workflow when:
- Deploying a new application to the cluster
- Updating an existing application's configuration
- Rolling out a new version of an application
## Steps
### 1. Gather Requirements
Ask the user for:
- Application name
- Container image and tag
- Namespace (default: `default`)
- Resource requirements (CPU/memory limits)
- Exposed ports
- Any special requirements (tolerations for Pi 3, etc.)
### 2. Check Existing State
Delegate to **argocd-operator** (haiku):
- Check if application already exists in ArgoCD
- If exists, get current status and version
Delegate to **k8s-diagnostician** (haiku):
- If exists, check current pod status
- Check namespace exists
### 3. Create/Update Manifests
Delegate to **git-operator** (sonnet):
- Create or update deployment manifest
- Create or update service manifest (if ports exposed)
- Create or update kustomization.yaml
- Include appropriate resource limits for Pi cluster:
```yaml
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "200m"
```
- If targeting Pi 3, add tolerations:
```yaml
tolerations:
- key: "node-type"
operator: "Equal"
value: "pi3"
effect: "NoSchedule"
```
### 4. Commit Changes
Delegate to **git-operator** (sonnet):
- Create feature branch: `deploy/<app-name>`
- Commit with message: `feat: deploy <app-name>`
- Push branch to origin
- Create pull request
**[CONFIRM]** User must approve the PR creation.
### 5. Sync Application
After PR is merged:
Delegate to **argocd-operator** (sonnet):
- Create ArgoCD application if new
- Trigger sync for the application
- Wait for sync to complete
**[CONFIRM]** User must approve the sync operation.
### 6. Verify Deployment
Delegate to **k8s-diagnostician** (haiku):
- Check pods are running
- Check no restart loops
- Verify resource usage is within limits
Report final status to user.
## Rollback
If deployment fails:
Delegate to **argocd-operator**:
- Check application history
- Propose rollback to previous version
**[CONFIRM]** User must approve rollback.

View File

@@ -0,0 +1,79 @@
name: cluster-health-check
description: Comprehensive cluster health assessment
version: "1.0"
trigger:
- schedule: "0 */6 * * *" # every 6 hours
- manual: true
defaults:
model: sonnet
steps:
- name: check-nodes
agent: k8s-diagnostician
model: haiku
task: |
Get node status for all nodes:
- Check node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
- Report any nodes not in Ready state
- Check resource usage with kubectl top nodes
output: node_status
- name: check-pods
agent: k8s-diagnostician
model: haiku
task: |
Get pod status across all namespaces:
- Count pods by status (Running, Pending, Failed, CrashLoopBackOff)
- List any unhealthy pods with their namespace and reason
- Check for high restart counts (>5 in last hour)
output: pod_status
- name: check-metrics
agent: prometheus-analyst
model: haiku
task: |
Query key cluster metrics:
- Node CPU and memory usage (current and 1h average)
- Top 5 pods by CPU usage
- Top 5 pods by memory usage
- Any active firing alerts
output: metrics_summary
- name: check-argocd
agent: argocd-operator
model: haiku
task: |
Check ArgoCD application status:
- List all applications with sync and health status
- Report any apps that are OutOfSync or Degraded
- Note last sync time for each app
output: argocd_status
- name: analyze-and-report
agent: k8s-orchestrator
model: sonnet
task: |
Analyze the health check results and create a summary report:
Inputs:
- Node status: {{ steps.check-nodes.output }}
- Pod status: {{ steps.check-pods.output }}
- Metrics: {{ steps.check-metrics.output }}
- ArgoCD: {{ steps.check-argocd.output }}
Create a report with:
1. Overall cluster health (Healthy/Degraded/Critical)
2. Summary table of key metrics
3. List of issues found (if any)
4. Recommended actions (mark as safe/confirm)
If issues are critical, propose immediate remediation steps.
output: health_report
confirm_if: actions_proposed
outputs:
- health_report
- node_status
- pod_status

View File

@@ -0,0 +1,140 @@
name: pod-crashloop-remediation
description: Diagnose and remediate pods in CrashLoopBackOff
version: "1.0"
trigger:
- alert:
match:
alertname: KubePodCrashLooping
- manual: true
inputs:
- name: namespace
description: Pod namespace
required: true
- name: pod
description: Pod name (or prefix)
required: true
defaults:
model: sonnet
steps:
- name: identify-pod
agent: k8s-diagnostician
model: haiku
task: |
Identify the crashing pod:
- Namespace: {{ inputs.namespace | default(alert.labels.namespace) }}
- Pod: {{ inputs.pod | default(alert.labels.pod) }}
Get pod details:
- Current status and restart count
- Last restart reason
- Container statuses
output: pod_info
- name: analyze-logs
agent: k8s-diagnostician
model: sonnet
task: |
Analyze pod logs for crash cause:
- Get current container logs (last 50 lines)
- Get previous container logs if available
- Look for error patterns:
- OOMKilled (exit code 137)
- Segfault (exit code 139)
- Application errors
- Configuration errors
- Dependency failures
Pod info: {{ steps.identify-pod.output }}
output: log_analysis
- name: check-resources
agent: prometheus-analyst
model: haiku
task: |
Check resource usage before crash:
- Memory usage trend (last 30 min)
- CPU usage trend (last 30 min)
- Compare to resource limits
Pod: {{ steps.identify-pod.output.pod_name }}
Namespace: {{ steps.identify-pod.output.namespace }}
output: resource_analysis
- name: check-dependencies
agent: k8s-diagnostician
model: haiku
task: |
Check pod dependencies:
- ConfigMaps and Secrets exist?
- PVCs bound?
- Service account valid?
- Init containers completed?
Pod info: {{ steps.identify-pod.output }}
output: dependency_check
- name: diagnose-and-recommend
agent: k8s-orchestrator
model: sonnet
task: |
Analyze all findings and determine root cause:
Evidence:
- Pod info: {{ steps.identify-pod.output }}
- Log analysis: {{ steps.analyze-logs.output }}
- Resource usage: {{ steps.check-resources.output }}
- Dependencies: {{ steps.check-dependencies.output }}
Determine:
1. Root cause (OOM, config error, dependency, application bug, etc.)
2. Severity (auto-recoverable, needs intervention, critical)
3. Recommended actions
Action classification:
- [SAFE] Restart pod, clear stuck jobs
- [CONFIRM] Increase resources, modify config
- [FORBIDDEN] Delete PVC, delete namespace
output: diagnosis
- name: apply-safe-remediation
condition: "{{ steps.diagnose-and-recommend.output.has_safe_action }}"
agent: k8s-diagnostician
model: haiku
task: |
Apply safe remediation actions:
{{ steps.diagnose-and-recommend.output.safe_actions }}
Report what was done.
output: safe_actions_result
- name: propose-confirm-actions
condition: "{{ steps.diagnose-and-recommend.output.has_confirm_action }}"
agent: k8s-orchestrator
model: haiku
task: |
Present actions requiring confirmation:
{{ steps.diagnose-and-recommend.output.confirm_actions }}
For each action, explain:
- What will change
- Potential impact
- Rollback option
output: confirm_proposal
confirm: true
outputs:
- diagnosis
- safe_actions_result
- confirm_proposal
notifications:
on_complete:
summary: |
CrashLoop remediation for {{ steps.identify-pod.output.pod_name }}:
- Root cause: {{ steps.diagnose-and-recommend.output.root_cause }}
- Actions taken: {{ steps.safe_actions_result.actions | default('none') }}
- Pending approval: {{ steps.confirm_proposal | default('none') }}