- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
483 lines
14 KiB
Markdown
483 lines
14 KiB
Markdown
# K8s Agent Orchestrator System - Design Document
|
|
|
|
## Overview
|
|
|
|
A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.
|
|
|
|
**Location**: `~/.claude/`
|
|
**Primary Domain**: DevOps/Infrastructure
|
|
**Target**: Raspberry Pi k0s cluster
|
|
|
|
---
|
|
|
|
## Cluster Environment
|
|
|
|
### Hardware
|
|
|
|
| Node | Hardware | RAM | Role |
|
|
|------|----------|-----|------|
|
|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
|
|
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
|
|
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
|
|
|
|
- **Architecture**: All nodes run arm64 (64-bit OS)
|
|
- **Pi 3 node**: Reserved for lightweight workloads only
|
|
|
|
### Stack
|
|
|
|
| Component | Technology |
|
|
|-----------|------------|
|
|
| K8s Distribution | k0s |
|
|
| GitOps | ArgoCD |
|
|
| Git Hosting | Self-hosted Gitea/Forgejo |
|
|
| Monitoring | Prometheus + Alertmanager + Grafana |
|
|
|
|
### CLI Tools Available
|
|
|
|
- `kubectl`
|
|
- `argocd`
|
|
- `k0sctl`
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### Three-Layer Design
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ User Interface │
|
|
│ Terminal (CLI) | Dashboard (Web) │
|
|
└─────────────────────┬───────────────────┬───────────────────┘
|
|
│ │
|
|
┌─────────────────────▼───────────────────▼───────────────────┐
|
|
│ Orchestrator Layer │
|
|
│ k8s-orchestrator │
|
|
│ (Opus - complex reasoning, task delegation) │
|
|
└─────────────────────┬───────────────────────────────────────┘
|
|
│ delegates to
|
|
┌─────────────────────▼───────────────────────────────────────┐
|
|
│ Specialist Layer │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
|
|
│ │k8s- │ │argocd- │ │prometheus- │ │git- ││
|
|
│ │diagnostician│ │operator │ │analyst │ │operator ││
|
|
│ │(Sonnet) │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) ││
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
|
|
└─────────────────────────────────────────────────────────────┘
|
|
│ defined by
|
|
┌─────────────────────▼───────────────────────────────────────┐
|
|
│ Workflow Layer │
|
|
│ YAML (complex) | Markdown (simple) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
~/.claude/
|
|
├── settings.json # Agent definitions, autonomy rules
|
|
├── agents/
|
|
│ ├── k8s-orchestrator.md # Orchestrator prompt
|
|
│ ├── k8s-diagnostician.md # Cluster diagnostics specialist
|
|
│ ├── argocd-operator.md # GitOps operations specialist
|
|
│ ├── prometheus-analyst.md # Metrics analysis specialist
|
|
│ └── git-operator.md # Git/Gitea operations specialist
|
|
├── workflows/
|
|
│ ├── health/
|
|
│ │ ├── cluster-health-check.yaml
|
|
│ │ └── node-pressure-response.yaml
|
|
│ ├── deploy/
|
|
│ │ ├── deploy-app.md
|
|
│ │ └── rollback-app.yaml
|
|
│ └── incidents/
|
|
│ └── pod-crashloop.yaml
|
|
├── skills/
|
|
│ ├── cluster-status.md
|
|
│ ├── deploy.md
|
|
│ ├── diagnose.md
|
|
│ ├── rollback.md
|
|
│ └── workflow.md
|
|
├── logs/
|
|
│ ├── actions/ # Action audit trail
|
|
│ └── workflows/ # Workflow execution logs
|
|
└── docs/plans/
|
|
```
|
|
|
|
---
|
|
|
|
## Subagent Definitions
|
|
|
|
### settings.json
|
|
|
|
```json
|
|
{
|
|
"agents": {
|
|
"k8s-orchestrator": {
|
|
"model": "opus",
|
|
"promptFile": "agents/k8s-orchestrator.md"
|
|
},
|
|
"k8s-diagnostician": {
|
|
"model": "sonnet",
|
|
"promptFile": "agents/k8s-diagnostician.md"
|
|
},
|
|
"argocd-operator": {
|
|
"model": "sonnet",
|
|
"promptFile": "agents/argocd-operator.md"
|
|
},
|
|
"prometheus-analyst": {
|
|
"model": "sonnet",
|
|
"promptFile": "agents/prometheus-analyst.md"
|
|
},
|
|
"git-operator": {
|
|
"model": "sonnet",
|
|
"promptFile": "agents/git-operator.md"
|
|
}
|
|
},
|
|
"autonomy": {
|
|
"safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
|
|
"confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
|
|
"forbidden_actions": ["drain", "cordon", "delete node", "reset"]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Subagent Responsibilities
|
|
|
|
| Agent | Scope | Tools |
|
|
|-------|-------|-------|
|
|
| **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) |
|
|
| **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools |
|
|
| **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl |
|
|
| **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API |
|
|
| **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API |
|
|
|
|
---
|
|
|
|
## Model Assignment
|
|
|
|
### Defaults
|
|
|
|
- **Orchestrator**: Opus (complex reasoning, task delegation)
|
|
- **Subagents**: Sonnet (standard operations)
|
|
|
|
### Override Levels
|
|
|
|
1. **Per-workflow**: Specify in workflow YAML
|
|
2. **Per-step**: Specify for individual workflow steps
|
|
3. **Dynamic**: Orchestrator selects based on task complexity
|
|
|
|
### Dynamic Model Selection (Orchestrator Logic)
|
|
|
|
| Task Complexity | Model | Examples |
|
|
|-----------------|-------|----------|
|
|
| Simple | Haiku | Get status, list resources, log tail |
|
|
| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
|
|
| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |
|
|
|
|
**Delegation syntax:**
|
|
```markdown
|
|
Delegate to k8s-diagnostician (haiku):
|
|
Task: Get current node status
|
|
|
|
Delegate to prometheus-analyst (sonnet):
|
|
Task: Analyze memory trends for namespace "prod" over last 24h
|
|
|
|
Delegate to k8s-diagnostician (opus):
|
|
Task: Investigate cascading failure across multiple services
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow Definitions
|
|
|
|
### YAML Workflows (Complex)
|
|
|
|
```yaml
|
|
name: cluster-health-check
|
|
description: Comprehensive cluster health assessment
|
|
model: sonnet # optional default override
|
|
trigger:
|
|
- schedule: "0 */6 * * *" # every 6 hours
|
|
- manual: true
|
|
|
|
steps:
|
|
- agent: k8s-diagnostician
|
|
model: haiku # simple status check
|
|
task: Check node status and resource pressure
|
|
|
|
- agent: prometheus-analyst
|
|
task: Query for anomalies in last 6 hours
|
|
|
|
- agent: argocd-operator
|
|
model: haiku
|
|
task: Check all apps sync status
|
|
|
|
- agent: k8s-orchestrator
|
|
task: Summarize findings and recommend actions
|
|
confirm_if: actions_proposed
|
|
```
|
|
|
|
### Markdown Workflows (Simple)
|
|
|
|
```markdown
|
|
# Deploy New App
|
|
|
|
When asked to deploy a new application:
|
|
|
|
1. Ask git-operator to create the manifest structure in the GitOps repo
|
|
2. Ask argocd-operator to create and sync the ArgoCD application
|
|
3. Ask k8s-diagnostician to verify pods are running
|
|
4. Report deployment status
|
|
```
|
|
|
|
### Incident Response Workflow Example
|
|
|
|
```yaml
|
|
name: pod-crashloop-remediation
|
|
trigger:
|
|
type: alert
|
|
match:
|
|
alertname: KubePodCrashLooping
|
|
|
|
steps:
|
|
- name: diagnose
|
|
agent: k8s-diagnostician
|
|
action: get-pod-status
|
|
inputs:
|
|
namespace: "{{ alert.labels.namespace }}"
|
|
pod: "{{ alert.labels.pod }}"
|
|
|
|
- name: check-logs
|
|
agent: k8s-diagnostician
|
|
action: analyze-logs
|
|
inputs:
|
|
pod: "{{ steps.diagnose.pod }}"
|
|
lines: 100
|
|
|
|
- name: decide-action
|
|
condition: "{{ steps.check-logs.cause == 'oom' }}"
|
|
branches:
|
|
true:
|
|
agent: argocd-operator
|
|
action: update-resources
|
|
confirm: true # risky action
|
|
false:
|
|
agent: k8s-diagnostician
|
|
action: restart-pod
|
|
confirm: false # safe action
|
|
|
|
- name: notify
|
|
action: report
|
|
outputs:
|
|
- summary
|
|
- actions-taken
|
|
```
|
|
|
|
---
|
|
|
|
## Autonomy Model
|
|
|
|
### Tiered Autonomy
|
|
|
|
| Action Type | Behavior | Examples |
|
|
|-------------|----------|----------|
|
|
| **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod |
|
|
| **Confirm** | Require user approval | delete, patch, scale, apply, modify config |
|
|
| **Forbidden** | Reject with explanation | drain, cordon, delete node |
|
|
|
|
### Confirmation Flow
|
|
|
|
```
|
|
1. Agent proposes action with rationale
|
|
2. System checks action against autonomy rules
|
|
3. If safe → execute immediately, log action
|
|
4. If confirm → present to user (CLI prompt or dashboard queue)
|
|
5. If forbidden → reject with explanation
|
|
```
|
|
|
|
### Per-Workflow Overrides
|
|
|
|
```yaml
|
|
name: emergency-pod-restart
|
|
autonomy:
|
|
auto_approve:
|
|
- restart_pod
|
|
- scale_replicas
|
|
always_confirm:
|
|
- delete_pvc
|
|
```
|
|
|
|
### Action Logging
|
|
|
|
```
|
|
~/.claude/logs/actions/2025-12-26-actions.jsonl
|
|
```
|
|
|
|
Each entry includes:
|
|
- Timestamp
|
|
- Agent
|
|
- Action
|
|
- Inputs
|
|
- Outcome
|
|
- Approval type (auto/user-confirmed)
|
|
|
|
---
|
|
|
|
## Skills (User Entry Points)
|
|
|
|
| Skill | Command | Purpose |
|
|
|-------|---------|---------|
|
|
| cluster-status | `/cluster-status` | Quick health overview |
|
|
| deploy | `/deploy <app>` | Deploy or update an app |
|
|
| diagnose | `/diagnose <issue>` | Investigate a problem |
|
|
| rollback | `/rollback <app>` | Revert to previous version |
|
|
| workflow | `/workflow <name>` | Run a named workflow |
|
|
|
|
### Example Skill: cluster-status.md
|
|
|
|
```markdown
|
|
# Cluster Status
|
|
|
|
Invoke the k8s-orchestrator to provide a quick health overview.
|
|
|
|
## Steps
|
|
1. Delegate to k8s-diagnostician: get node status
|
|
2. Delegate to prometheus-analyst: check for active alerts
|
|
3. Delegate to argocd-operator: list out-of-sync apps
|
|
4. Summarize in a concise table
|
|
|
|
## Output Format
|
|
- Node health: table
|
|
- Active alerts: bullet list
|
|
- ArgoCD status: table
|
|
- Recommendations: if any issues found
|
|
```
|
|
|
|
---
|
|
|
|
## Interaction Methods
|
|
|
|
### Terminal/CLI
|
|
|
|
- Primary interaction via Claude Code
|
|
- Fallback when cluster is unavailable
|
|
- Use skills to invoke workflows
|
|
|
|
### Dashboard (Web UI)
|
|
|
|
- Deployed on cluster (Pi 3 node)
|
|
- Views: Status, Pending Confirmations, History, Workflows
|
|
- Approve/reject risky actions
|
|
|
|
### Push Notifications (Future)
|
|
|
|
- Discord, Slack, or Telegram integration
|
|
- Alert on issues requiring attention
|
|
|
|
---
|
|
|
|
## Dashboard Specification
|
|
|
|
### Tech Stack
|
|
|
|
- **Backend**: Go binary (single static binary, embedded assets)
|
|
- **Storage**: SQLite or flat JSON files
|
|
- **Resources**: Minimal footprint for Pi 3
|
|
|
|
### Deployment
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: k8s-agent-dashboard
|
|
spec:
|
|
replicas: 1
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: dashboard
|
|
image: k8s-agent-dashboard:latest
|
|
resources:
|
|
requests:
|
|
memory: "32Mi"
|
|
cpu: "10m"
|
|
limits:
|
|
memory: "64Mi"
|
|
cpu: "100m"
|
|
tolerations:
|
|
- key: "node-type"
|
|
operator: "Equal"
|
|
value: "pi3"
|
|
effect: "NoSchedule"
|
|
nodeSelector:
|
|
kubernetes.io/arch: arm64
|
|
```
|
|
|
|
### Views
|
|
|
|
| View | Description |
|
|
|------|-------------|
|
|
| Status | Current cluster health, active alerts, ArgoCD sync state |
|
|
| Pending | Actions awaiting confirmation with approve/reject buttons |
|
|
| History | Recent actions taken, filterable by agent/workflow |
|
|
| Workflows | List of defined workflows, manual trigger capability |
|
|
|
|
---
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Core Agent System
|
|
|
|
**Deliverables:**
|
|
- `~/.claude/` directory structure
|
|
- Orchestrator and 4 subagent prompt files
|
|
- `settings.json` with agent configurations
|
|
- 3-4 essential workflows (cluster-health, deploy, diagnose)
|
|
- Core skills (/cluster-status, /deploy, /diagnose)
|
|
|
|
**Validation:**
|
|
- Manual CLI invocation
|
|
- Test each subagent independently
|
|
- Run health check workflow end-to-end
|
|
|
|
### Phase 2: Dashboard
|
|
|
|
**Deliverables:**
|
|
- Go-based dashboard application
|
|
- Kubernetes manifests for Pi 3 deployment
|
|
- Pending confirmations queue
|
|
- Action history view
|
|
- Approval flow integration
|
|
|
|
### Phase 3: Automation
|
|
|
|
**Deliverables:**
|
|
- Scheduled workflow execution
|
|
- Alertmanager webhook integration
|
|
- Expanded incident response workflows
|
|
|
|
### Phase 4: Expansion (Future)
|
|
|
|
**Potential additions:**
|
|
- Push notifications (Discord/Telegram)
|
|
- Additional domains (development, research, productivity)
|
|
- SDK-based background daemon for true autonomy
|
|
|
|
---
|
|
|
|
## Future Domain Expansion
|
|
|
|
The system is designed to expand beyond DevOps:
|
|
|
|
| Domain | Use Cases |
|
|
|--------|-----------|
|
|
| Software Development | Code generation, refactoring, testing across repos |
|
|
| Research & Analysis | Information gathering, summarizing, recommendations |
|
|
| Personal Productivity | File management, notes, task tracking |
|
|
|
|
New domains would add:
|
|
- Additional subagents with specialized prompts
|
|
- Domain-specific workflows
|
|
- New skills for user invocation
|