feat: Implement Phase 1 K8s agent orchestrator system
Core agent system for Raspberry Pi k0s cluster management: Agents: - k8s-orchestrator: Central task delegation and decision making - k8s-diagnostician: Cluster health, logs, troubleshooting - argocd-operator: GitOps deployments and rollbacks - prometheus-analyst: Metrics queries and alert analysis - git-operator: Manifest management and PR workflows Workflows: - cluster-health-check.yaml: Scheduled health assessment - deploy-app.md: Application deployment guide - pod-crashloop.yaml: Automated incident response Skills: - /cluster-status: Quick health overview - /deploy: Deploy or update applications - /diagnose: Investigate cluster issues Configuration: - Agent definitions with model assignments (Opus/Sonnet) - Autonomy rules (safe/confirm/forbidden actions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
97
workflows/deploy/deploy-app.md
Normal file
97
workflows/deploy/deploy-app.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Deploy Application Workflow
|
||||
|
||||
A simple workflow for deploying new applications or updating existing ones.
|
||||
|
||||
## When to use
|
||||
|
||||
Use this workflow when:
|
||||
- Deploying a new application to the cluster
|
||||
- Updating an existing application's configuration
|
||||
- Rolling out a new version of an application
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Gather Requirements
|
||||
|
||||
Ask the user for:
|
||||
- Application name
|
||||
- Container image and tag
|
||||
- Namespace (default: `default`)
|
||||
- Resource requirements (CPU/memory limits)
|
||||
- Exposed ports
|
||||
- Any special requirements (tolerations for Pi 3, etc.)
|
||||
|
||||
### 2. Check Existing State
|
||||
|
||||
Delegate to **argocd-operator** (haiku):
|
||||
- Check if application already exists in ArgoCD
|
||||
- If exists, get current status and version
|
||||
|
||||
Delegate to **k8s-diagnostician** (haiku):
|
||||
- If exists, check current pod status
|
||||
- Check namespace exists
|
||||
|
||||
### 3. Create/Update Manifests
|
||||
|
||||
Delegate to **git-operator** (sonnet):
|
||||
- Create or update deployment manifest
|
||||
- Create or update service manifest (if ports exposed)
|
||||
- Create or update kustomization.yaml
|
||||
- Include appropriate resource limits for Pi cluster:
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
memory: "64Mi"
|
||||
cpu: "50m"
|
||||
limits:
|
||||
memory: "128Mi"
|
||||
cpu: "200m"
|
||||
```
|
||||
- If targeting Pi 3, add tolerations:
|
||||
```yaml
|
||||
tolerations:
|
||||
- key: "node-type"
|
||||
operator: "Equal"
|
||||
value: "pi3"
|
||||
effect: "NoSchedule"
|
||||
```
|
||||
|
||||
### 4. Commit Changes
|
||||
|
||||
Delegate to **git-operator** (sonnet):
|
||||
- Create feature branch: `deploy/<app-name>`
|
||||
- Commit with message: `feat: deploy <app-name>`
|
||||
- Push branch to origin
|
||||
- Create pull request
|
||||
|
||||
**[CONFIRM]** User must approve the PR creation.
|
||||
|
||||
### 5. Sync Application
|
||||
|
||||
After PR is merged:
|
||||
|
||||
Delegate to **argocd-operator** (sonnet):
|
||||
- Create ArgoCD application if new
|
||||
- Trigger sync for the application
|
||||
- Wait for sync to complete
|
||||
|
||||
**[CONFIRM]** User must approve the sync operation.
|
||||
|
||||
### 6. Verify Deployment
|
||||
|
||||
Delegate to **k8s-diagnostician** (haiku):
|
||||
- Check pods are running
|
||||
- Check no restart loops
|
||||
- Verify resource usage is within limits
|
||||
|
||||
Report final status to user.
|
||||
|
||||
## Rollback
|
||||
|
||||
If deployment fails:
|
||||
|
||||
Delegate to **argocd-operator**:
|
||||
- Check application history
|
||||
- Propose rollback to previous version
|
||||
|
||||
**[CONFIRM]** User must approve rollback.
|
||||
79
workflows/health/cluster-health-check.yaml
Normal file
79
workflows/health/cluster-health-check.yaml
Normal file
@@ -0,0 +1,79 @@
|
||||
name: cluster-health-check
|
||||
description: Comprehensive cluster health assessment
|
||||
version: "1.0"
|
||||
|
||||
trigger:
|
||||
- schedule: "0 */6 * * *" # every 6 hours
|
||||
- manual: true
|
||||
|
||||
defaults:
|
||||
model: sonnet
|
||||
|
||||
steps:
|
||||
- name: check-nodes
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Get node status for all nodes:
|
||||
- Check node conditions (Ready, MemoryPressure, DiskPressure, PIDPressure)
|
||||
- Report any nodes not in Ready state
|
||||
- Check resource usage with kubectl top nodes
|
||||
output: node_status
|
||||
|
||||
- name: check-pods
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Get pod status across all namespaces:
|
||||
- Count pods by status (Running, Pending, Failed, CrashLoopBackOff)
|
||||
- List any unhealthy pods with their namespace and reason
|
||||
- Check for high restart counts (>5 in last hour)
|
||||
output: pod_status
|
||||
|
||||
- name: check-metrics
|
||||
agent: prometheus-analyst
|
||||
model: haiku
|
||||
task: |
|
||||
Query key cluster metrics:
|
||||
- Node CPU and memory usage (current and 1h average)
|
||||
- Top 5 pods by CPU usage
|
||||
- Top 5 pods by memory usage
|
||||
- Any active firing alerts
|
||||
output: metrics_summary
|
||||
|
||||
- name: check-argocd
|
||||
agent: argocd-operator
|
||||
model: haiku
|
||||
task: |
|
||||
Check ArgoCD application status:
|
||||
- List all applications with sync and health status
|
||||
- Report any apps that are OutOfSync or Degraded
|
||||
- Note last sync time for each app
|
||||
output: argocd_status
|
||||
|
||||
- name: analyze-and-report
|
||||
agent: k8s-orchestrator
|
||||
model: sonnet
|
||||
task: |
|
||||
Analyze the health check results and create a summary report:
|
||||
|
||||
Inputs:
|
||||
- Node status: {{ steps.check-nodes.output }}
|
||||
- Pod status: {{ steps.check-pods.output }}
|
||||
- Metrics: {{ steps.check-metrics.output }}
|
||||
- ArgoCD: {{ steps.check-argocd.output }}
|
||||
|
||||
Create a report with:
|
||||
1. Overall cluster health (Healthy/Degraded/Critical)
|
||||
2. Summary table of key metrics
|
||||
3. List of issues found (if any)
|
||||
4. Recommended actions (mark as safe/confirm)
|
||||
|
||||
If issues are critical, propose immediate remediation steps.
|
||||
output: health_report
|
||||
confirm_if: actions_proposed
|
||||
|
||||
outputs:
|
||||
- health_report
|
||||
- node_status
|
||||
- pod_status
|
||||
140
workflows/incidents/pod-crashloop.yaml
Normal file
140
workflows/incidents/pod-crashloop.yaml
Normal file
@@ -0,0 +1,140 @@
|
||||
name: pod-crashloop-remediation
|
||||
description: Diagnose and remediate pods in CrashLoopBackOff
|
||||
version: "1.0"
|
||||
|
||||
trigger:
|
||||
- alert:
|
||||
match:
|
||||
alertname: KubePodCrashLooping
|
||||
- manual: true
|
||||
inputs:
|
||||
- name: namespace
|
||||
description: Pod namespace
|
||||
required: true
|
||||
- name: pod
|
||||
description: Pod name (or prefix)
|
||||
required: true
|
||||
|
||||
defaults:
|
||||
model: sonnet
|
||||
|
||||
steps:
|
||||
- name: identify-pod
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Identify the crashing pod:
|
||||
- Namespace: {{ inputs.namespace | default(alert.labels.namespace) }}
|
||||
- Pod: {{ inputs.pod | default(alert.labels.pod) }}
|
||||
|
||||
Get pod details:
|
||||
- Current status and restart count
|
||||
- Last restart reason
|
||||
- Container statuses
|
||||
output: pod_info
|
||||
|
||||
- name: analyze-logs
|
||||
agent: k8s-diagnostician
|
||||
model: sonnet
|
||||
task: |
|
||||
Analyze pod logs for crash cause:
|
||||
- Get current container logs (last 50 lines)
|
||||
- Get previous container logs if available
|
||||
- Look for error patterns:
|
||||
- OOMKilled (exit code 137)
|
||||
- Segfault (exit code 139)
|
||||
- Application errors
|
||||
- Configuration errors
|
||||
- Dependency failures
|
||||
|
||||
Pod info: {{ steps.identify-pod.output }}
|
||||
output: log_analysis
|
||||
|
||||
- name: check-resources
|
||||
agent: prometheus-analyst
|
||||
model: haiku
|
||||
task: |
|
||||
Check resource usage before crash:
|
||||
- Memory usage trend (last 30 min)
|
||||
- CPU usage trend (last 30 min)
|
||||
- Compare to resource limits
|
||||
|
||||
Pod: {{ steps.identify-pod.output.pod_name }}
|
||||
Namespace: {{ steps.identify-pod.output.namespace }}
|
||||
output: resource_analysis
|
||||
|
||||
- name: check-dependencies
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Check pod dependencies:
|
||||
- ConfigMaps and Secrets exist?
|
||||
- PVCs bound?
|
||||
- Service account valid?
|
||||
- Init containers completed?
|
||||
|
||||
Pod info: {{ steps.identify-pod.output }}
|
||||
output: dependency_check
|
||||
|
||||
- name: diagnose-and-recommend
|
||||
agent: k8s-orchestrator
|
||||
model: sonnet
|
||||
task: |
|
||||
Analyze all findings and determine root cause:
|
||||
|
||||
Evidence:
|
||||
- Pod info: {{ steps.identify-pod.output }}
|
||||
- Log analysis: {{ steps.analyze-logs.output }}
|
||||
- Resource usage: {{ steps.check-resources.output }}
|
||||
- Dependencies: {{ steps.check-dependencies.output }}
|
||||
|
||||
Determine:
|
||||
1. Root cause (OOM, config error, dependency, application bug, etc.)
|
||||
2. Severity (auto-recoverable, needs intervention, critical)
|
||||
3. Recommended actions
|
||||
|
||||
Action classification:
|
||||
- [SAFE] Restart pod, clear stuck jobs
|
||||
- [CONFIRM] Increase resources, modify config
|
||||
- [FORBIDDEN] Delete PVC, delete namespace
|
||||
output: diagnosis
|
||||
|
||||
- name: apply-safe-remediation
|
||||
condition: "{{ steps.diagnose-and-recommend.output.has_safe_action }}"
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Apply safe remediation actions:
|
||||
{{ steps.diagnose-and-recommend.output.safe_actions }}
|
||||
|
||||
Report what was done.
|
||||
output: safe_actions_result
|
||||
|
||||
- name: propose-confirm-actions
|
||||
condition: "{{ steps.diagnose-and-recommend.output.has_confirm_action }}"
|
||||
agent: k8s-orchestrator
|
||||
model: haiku
|
||||
task: |
|
||||
Present actions requiring confirmation:
|
||||
|
||||
{{ steps.diagnose-and-recommend.output.confirm_actions }}
|
||||
|
||||
For each action, explain:
|
||||
- What will change
|
||||
- Potential impact
|
||||
- Rollback option
|
||||
output: confirm_proposal
|
||||
confirm: true
|
||||
|
||||
outputs:
|
||||
- diagnosis
|
||||
- safe_actions_result
|
||||
- confirm_proposal
|
||||
|
||||
notifications:
|
||||
on_complete:
|
||||
summary: |
|
||||
CrashLoop remediation for {{ steps.identify-pod.output.pod_name }}:
|
||||
- Root cause: {{ steps.diagnose-and-recommend.output.root_cause }}
|
||||
- Actions taken: {{ steps.safe_actions_result.actions | default('none') }}
|
||||
- Pending approval: {{ steps.confirm_proposal | default('none') }}
|
||||
Reference in New Issue
Block a user