Files
claude-code/docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
OpenCode Test 216a95cec4 Initial commit: Claude Code config and K8s agent orchestrator design
- Add .gitignore for logs, caches, credentials, and history
- Add K8s agent orchestrator design document
- Include existing Claude Code settings and plugin configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:16:07 -08:00

483 lines
14 KiB
Markdown

# K8s Agent Orchestrator System - Design Document
## Overview
A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.
**Location**: `~/.claude/`
**Primary Domain**: DevOps/Infrastructure
**Target**: Raspberry Pi k0s cluster
---
## Cluster Environment
### Hardware
| Node | Hardware | RAM | Role |
|------|----------|-----|------|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
- **Architecture**: All nodes run arm64 (64-bit OS)
- **Pi 3 node**: Reserved for lightweight workloads only
### Stack
| Component | Technology |
|-----------|------------|
| K8s Distribution | k0s |
| GitOps | ArgoCD |
| Git Hosting | Self-hosted Gitea/Forgejo |
| Monitoring | Prometheus + Alertmanager + Grafana |
### CLI Tools Available
- `kubectl`
- `argocd`
- `k0sctl`
---
## Architecture
### Three-Layer Design
```
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ Terminal (CLI) | Dashboard (Web) │
└─────────────────────┬───────────────────┬───────────────────┘
│ │
┌─────────────────────▼───────────────────▼───────────────────┐
│ Orchestrator Layer │
│ k8s-orchestrator │
│ (Opus - complex reasoning, task delegation) │
└─────────────────────┬───────────────────────────────────────┘
│ delegates to
┌─────────────────────▼───────────────────────────────────────┐
│ Specialist Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
│ │k8s- │ │argocd- │ │prometheus- │ │git- ││
│ │diagnostician│ │operator │ │analyst │ │operator ││
│ │(Sonnet) │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) ││
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────┘
│ defined by
┌─────────────────────▼───────────────────────────────────────┐
│ Workflow Layer │
│ YAML (complex) | Markdown (simple) │
└─────────────────────────────────────────────────────────────┘
```
### Directory Structure
```
~/.claude/
├── settings.json # Agent definitions, autonomy rules
├── agents/
│ ├── k8s-orchestrator.md # Orchestrator prompt
│ ├── k8s-diagnostician.md # Cluster diagnostics specialist
│ ├── argocd-operator.md # GitOps operations specialist
│ ├── prometheus-analyst.md # Metrics analysis specialist
│ └── git-operator.md # Git/Gitea operations specialist
├── workflows/
│ ├── health/
│ │ ├── cluster-health-check.yaml
│ │ └── node-pressure-response.yaml
│ ├── deploy/
│ │ ├── deploy-app.md
│ │ └── rollback-app.yaml
│ └── incidents/
│ └── pod-crashloop.yaml
├── skills/
│ ├── cluster-status.md
│ ├── deploy.md
│ ├── diagnose.md
│ ├── rollback.md
│ └── workflow.md
├── logs/
│ ├── actions/ # Action audit trail
│ └── workflows/ # Workflow execution logs
└── docs/plans/
```
---
## Subagent Definitions
### settings.json
```json
{
"agents": {
"k8s-orchestrator": {
"model": "opus",
"promptFile": "agents/k8s-orchestrator.md"
},
"k8s-diagnostician": {
"model": "sonnet",
"promptFile": "agents/k8s-diagnostician.md"
},
"argocd-operator": {
"model": "sonnet",
"promptFile": "agents/argocd-operator.md"
},
"prometheus-analyst": {
"model": "sonnet",
"promptFile": "agents/prometheus-analyst.md"
},
"git-operator": {
"model": "sonnet",
"promptFile": "agents/git-operator.md"
}
},
"autonomy": {
"safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
"confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
"forbidden_actions": ["drain", "cordon", "delete node", "reset"]
}
}
```
### Subagent Responsibilities
| Agent | Scope | Tools |
|-------|-------|-------|
| **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) |
| **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools |
| **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl |
| **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API |
| **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API |
---
## Model Assignment
### Defaults
- **Orchestrator**: Opus (complex reasoning, task delegation)
- **Subagents**: Sonnet (standard operations)
### Override Levels
1. **Per-workflow**: Specify in workflow YAML
2. **Per-step**: Specify for individual workflow steps
3. **Dynamic**: Orchestrator selects based on task complexity
### Dynamic Model Selection (Orchestrator Logic)
| Task Complexity | Model | Examples |
|-----------------|-------|----------|
| Simple | Haiku | Get status, list resources, log tail |
| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |
**Delegation syntax:**
```markdown
Delegate to k8s-diagnostician (haiku):
Task: Get current node status
Delegate to prometheus-analyst (sonnet):
Task: Analyze memory trends for namespace "prod" over last 24h
Delegate to k8s-diagnostician (opus):
Task: Investigate cascading failure across multiple services
```
---
## Workflow Definitions
### YAML Workflows (Complex)
```yaml
name: cluster-health-check
description: Comprehensive cluster health assessment
model: sonnet # optional default override
trigger:
- schedule: "0 */6 * * *" # every 6 hours
- manual: true
steps:
- agent: k8s-diagnostician
model: haiku # simple status check
task: Check node status and resource pressure
- agent: prometheus-analyst
task: Query for anomalies in last 6 hours
- agent: argocd-operator
model: haiku
task: Check all apps sync status
- agent: k8s-orchestrator
task: Summarize findings and recommend actions
confirm_if: actions_proposed
```
### Markdown Workflows (Simple)
```markdown
# Deploy New App
When asked to deploy a new application:
1. Ask git-operator to create the manifest structure in the GitOps repo
2. Ask argocd-operator to create and sync the ArgoCD application
3. Ask k8s-diagnostician to verify pods are running
4. Report deployment status
```
### Incident Response Workflow Example
```yaml
name: pod-crashloop-remediation
trigger:
type: alert
match:
alertname: KubePodCrashLooping
steps:
- name: diagnose
agent: k8s-diagnostician
action: get-pod-status
inputs:
namespace: "{{ alert.labels.namespace }}"
pod: "{{ alert.labels.pod }}"
- name: check-logs
agent: k8s-diagnostician
action: analyze-logs
inputs:
pod: "{{ steps.diagnose.pod }}"
lines: 100
- name: decide-action
condition: "{{ steps.check-logs.cause == 'oom' }}"
branches:
true:
agent: argocd-operator
action: update-resources
confirm: true # risky action
false:
agent: k8s-diagnostician
action: restart-pod
confirm: false # safe action
- name: notify
action: report
outputs:
- summary
- actions-taken
```
---
## Autonomy Model
### Tiered Autonomy
| Action Type | Behavior | Examples |
|-------------|----------|----------|
| **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod |
| **Confirm** | Require user approval | delete, patch, scale, apply, modify config |
| **Forbidden** | Reject with explanation | drain, cordon, delete node |
### Confirmation Flow
```
1. Agent proposes action with rationale
2. System checks action against autonomy rules
3. If safe → execute immediately, log action
4. If confirm → present to user (CLI prompt or dashboard queue)
5. If forbidden → reject with explanation
```
### Per-Workflow Overrides
```yaml
name: emergency-pod-restart
autonomy:
auto_approve:
- restart_pod
- scale_replicas
always_confirm:
- delete_pvc
```
### Action Logging
```
~/.claude/logs/actions/2025-12-26-actions.jsonl
```
Each entry includes:
- Timestamp
- Agent
- Action
- Inputs
- Outcome
- Approval type (auto/user-confirmed)
---
## Skills (User Entry Points)
| Skill | Command | Purpose |
|-------|---------|---------|
| cluster-status | `/cluster-status` | Quick health overview |
| deploy | `/deploy <app>` | Deploy or update an app |
| diagnose | `/diagnose <issue>` | Investigate a problem |
| rollback | `/rollback <app>` | Revert to previous version |
| workflow | `/workflow <name>` | Run a named workflow |
### Example Skill: cluster-status.md
```markdown
# Cluster Status
Invoke the k8s-orchestrator to provide a quick health overview.
## Steps
1. Delegate to k8s-diagnostician: get node status
2. Delegate to prometheus-analyst: check for active alerts
3. Delegate to argocd-operator: list out-of-sync apps
4. Summarize in a concise table
## Output Format
- Node health: table
- Active alerts: bullet list
- ArgoCD status: table
- Recommendations: if any issues found
```
---
## Interaction Methods
### Terminal/CLI
- Primary interaction via Claude Code
- Fallback when cluster is unavailable
- Use skills to invoke workflows
### Dashboard (Web UI)
- Deployed on cluster (Pi 3 node)
- Views: Status, Pending Confirmations, History, Workflows
- Approve/reject risky actions
### Push Notifications (Future)
- Discord, Slack, or Telegram integration
- Alert on issues requiring attention
---
## Dashboard Specification
### Tech Stack
- **Backend**: Go binary (single static binary, embedded assets)
- **Storage**: SQLite or flat JSON files
- **Resources**: Minimal footprint for Pi 3
### Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-agent-dashboard
spec:
replicas: 1
template:
spec:
containers:
- name: dashboard
image: k8s-agent-dashboard:latest
resources:
requests:
memory: "32Mi"
cpu: "10m"
limits:
memory: "64Mi"
cpu: "100m"
tolerations:
- key: "node-type"
operator: "Equal"
value: "pi3"
effect: "NoSchedule"
nodeSelector:
kubernetes.io/arch: arm64
```
### Views
| View | Description |
|------|-------------|
| Status | Current cluster health, active alerts, ArgoCD sync state |
| Pending | Actions awaiting confirmation with approve/reject buttons |
| History | Recent actions taken, filterable by agent/workflow |
| Workflows | List of defined workflows, manual trigger capability |
---
## Implementation Phases
### Phase 1: Core Agent System
**Deliverables:**
- `~/.claude/` directory structure
- Orchestrator and 4 subagent prompt files
- `settings.json` with agent configurations
- 3-4 essential workflows (cluster-health, deploy, diagnose)
- Core skills (/cluster-status, /deploy, /diagnose)
**Validation:**
- Manual CLI invocation
- Test each subagent independently
- Run health check workflow end-to-end
### Phase 2: Dashboard
**Deliverables:**
- Go-based dashboard application
- Kubernetes manifests for Pi 3 deployment
- Pending confirmations queue
- Action history view
- Approval flow integration
### Phase 3: Automation
**Deliverables:**
- Scheduled workflow execution
- Alertmanager webhook integration
- Expanded incident response workflows
### Phase 4: Expansion (Future)
**Potential additions:**
- Push notifications (Discord/Telegram)
- Additional domains (development, research, productivity)
- SDK-based background daemon for true autonomy
---
## Future Domain Expansion
The system is designed to expand beyond DevOps:
| Domain | Use Cases |
|--------|-----------|
| Software Development | Code generation, refactoring, testing across repos |
| Research & Analysis | Information gathering, summarizing, recommendations |
| Personal Productivity | File management, notes, task tracking |
New domains would add:
- Additional subagents with specialized prompts
- Domain-specific workflows
- New skills for user invocation