feat: Implement Phase 1 K8s agent orchestrator system

Core agent system for Raspberry Pi k0s cluster management:

Agents:
- k8s-orchestrator: Central task delegation and decision making
- k8s-diagnostician: Cluster health, logs, troubleshooting
- argocd-operator: GitOps deployments and rollbacks
- prometheus-analyst: Metrics queries and alert analysis
- git-operator: Manifest management and PR workflows

Workflows:
- cluster-health-check.yaml: Scheduled health assessment
- deploy-app.md: Application deployment guide
- pod-crashloop.yaml: Automated incident response

Skills:
- /cluster-status: Quick health overview
- /deploy: Deploy or update applications
- /diagnose: Investigate cluster issues

Configuration:
- Agent definitions with model assignments (Opus/Sonnet)
- Autonomy rules (safe/confirm/forbidden actions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
OpenCode Test
2025-12-26 11:25:11 -08:00
parent 216a95cec4
commit a80f714fc2
12 changed files with 1302 additions and 1 deletions

64
skills/cluster-status.md Normal file
View File

@@ -0,0 +1,64 @@
# Cluster Status
Get a quick health overview of the Raspberry Pi Kubernetes cluster.
## Usage
```
/cluster-status
```
## What it does
Invokes the k8s-orchestrator to provide a comprehensive cluster health overview by delegating to specialized agents.
## Steps
1. **Node Health** (k8s-diagnostician, haiku)
- Get all node statuses
- Check for any conditions (MemoryPressure, DiskPressure)
- Report resource usage per node
2. **Active Alerts** (prometheus-analyst, haiku)
- Query Alertmanager for firing alerts
- List alert names and severity
3. **ArgoCD Status** (argocd-operator, haiku)
- List all applications
- Report sync status (Synced/OutOfSync)
- Report health status (Healthy/Degraded)
4. **Summary** (k8s-orchestrator, sonnet)
- Aggregate findings
- Produce overall health rating
- Recommend actions if issues found
## Output Format
```
Cluster Status: [Healthy/Degraded/Critical]
Nodes:
| Node | Status | CPU | Memory | Conditions |
|--------|--------|------|--------|------------|
| pi5-1 | Ready | 45% | 68% | OK |
| pi5-2 | Ready | 32% | 52% | OK |
| pi3 | Ready | 78% | 89% | MemPressure|
Active Alerts: [count]
- [FIRING] AlertName - description
ArgoCD Apps:
| App | Sync | Health |
|-----------|----------|-----------|
| homepage | Synced | Healthy |
| api | OutOfSync| Degraded |
Recommendations:
- [action if needed]
```
## Options
- `--full` - Run the complete cluster-health-check workflow
- `--quick` - Just node and pod status (faster)

83
skills/deploy.md Normal file
View File

@@ -0,0 +1,83 @@
# Deploy Application
Deploy a new application or update an existing one on the Raspberry Pi Kubernetes cluster.
## Usage
```
/deploy <app-name>
/deploy <app-name> --image <image:tag>
/deploy <app-name> --update
```
## What it does
Guides you through deploying an application using the GitOps workflow with ArgoCD.
## Interactive Mode
When run without full arguments, the skill will ask for:
1. **Application name** - Name for the deployment
2. **Container image** - Full image path with tag
3. **Namespace** - Target namespace (default: default)
4. **Ports** - Exposed ports (comma-separated)
5. **Resources** - Memory/CPU limits (defaults provided for Pi)
6. **Pi 3 compatible?** - Whether to add tolerations for Pi 3 node
## Quick Deploy
```
/deploy myapp --image ghcr.io/user/myapp:latest --namespace apps --port 8080
```
## Steps
1. **Check existing state** - See if app exists, current status
2. **Generate manifests** - Create deployment, service, kustomization
3. **Create PR** - Push to GitOps repo, create PR
4. **Sync** - After PR merge, trigger ArgoCD sync
5. **Verify** - Confirm pods are running
## Resource Defaults (Pi-optimized)
```yaml
# Standard workload
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "200m"
# Lightweight (Pi 3 compatible)
requests:
memory: "32Mi"
cpu: "25m"
limits:
memory: "64Mi"
cpu: "100m"
```
## Examples
### Deploy new app
```
/deploy homepage --image nginx:alpine --port 80 --namespace web
```
### Update existing app
```
/deploy api --update --image api:v2.0.0
```
### Deploy to Pi 3
```
/deploy lightweight-app --image app:latest --pi3
```
## Confirmation Points
- **[CONFIRM]** Creating PR in GitOps repo
- **[CONFIRM]** Syncing ArgoCD application
- **[CONFIRM]** Rollback if deployment fails

124
skills/diagnose.md Normal file
View File

@@ -0,0 +1,124 @@
# Diagnose Issue
Investigate and diagnose problems in the Raspberry Pi Kubernetes cluster.
## Usage
```
/diagnose <issue-description>
/diagnose pod <pod-name> -n <namespace>
/diagnose app <argocd-app-name>
/diagnose node <node-name>
```
## What it does
Invokes the k8s-orchestrator to investigate issues by coordinating multiple specialist agents.
## Diagnosis Types
### General Issue
```
/diagnose "my app is returning 503 errors"
```
The orchestrator will:
1. Identify relevant resources
2. Check pod status and logs
3. Query relevant metrics
4. Analyze ArgoCD sync state
5. Provide diagnosis and recommendations
### Pod Diagnosis
```
/diagnose pod myapp-7d9f8b6c5-x2k4m -n production
```
Focuses on:
- Pod status and events
- Container logs (current and previous)
- Resource usage vs limits
- Restart history
- Related alerts
### ArgoCD App Diagnosis
```
/diagnose app homepage
```
Focuses on:
- Sync status and history
- Health status of resources
- Diff between desired and live state
- Recent sync errors
### Node Diagnosis
```
/diagnose node pi5-1
```
Focuses on:
- Node conditions
- Resource pressure
- Running pods count
- System events
- Disk and network status
## Investigation Flow
```
User describes issue
┌─────────────────┐
│ k8s-orchestrator│ ─── Analyze issue, plan investigation
└────────┬────────┘
┌────┼────┬────────┐
▼ ▼ ▼ ▼
┌──────┐┌──────┐┌──────┐┌──────┐
│diag- ││argo- ││prom- ││git- │
│nosti-││cd- ││etheus││opera-│
│cian ││oper- ││analy-││tor │
│ ││ator ││st ││ │
└──┬───┘└──┬───┘└──┬───┘└──┬───┘
│ │ │ │
└───────┴───────┴───────┘
┌─────────────────┐
│ k8s-orchestrator│ ─── Synthesize findings
└────────┬────────┘
Diagnosis + Recommendations
```
## Output Format
```
Diagnosis for: [issue description]
Status: [Investigating/Identified/Resolved]
Findings:
1. [Finding with evidence]
2. [Finding with evidence]
Root Cause:
[Explanation of what's causing the issue]
Evidence:
- [Relevant log lines or metrics]
- [Command outputs]
Recommended Actions:
- [SAFE] Action that can be auto-applied
- [CONFIRM] Action requiring approval
- [INFO] Suggestion for manual follow-up
Severity: [Low/Medium/High/Critical]
```
## Options
- `--verbose` - Include full command outputs
- `--logs` - Focus on log analysis
- `--metrics` - Focus on metrics analysis
- `--quick` - Fast surface-level check only