Initial commit: Claude Code config and K8s agent orchestrator design
- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
482
docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
Normal file
482
docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
Normal file
@@ -0,0 +1,482 @@
|
||||
# K8s Agent Orchestrator System - Design Document
|
||||
|
||||
## Overview
|
||||
|
||||
A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.
|
||||
|
||||
**Location**: `~/.claude/`
|
||||
**Primary Domain**: DevOps/Infrastructure
|
||||
**Target**: Raspberry Pi k0s cluster
|
||||
|
||||
---
|
||||
|
||||
## Cluster Environment
|
||||
|
||||
### Hardware
|
||||
|
||||
| Node | Hardware | RAM | Role |
|
||||
|------|----------|-----|------|
|
||||
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
|
||||
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
|
||||
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
|
||||
|
||||
- **Architecture**: All nodes run arm64 (64-bit OS)
|
||||
- **Pi 3 node**: Reserved for lightweight workloads only
|
||||
|
||||
### Stack
|
||||
|
||||
| Component | Technology |
|
||||
|-----------|------------|
|
||||
| K8s Distribution | k0s |
|
||||
| GitOps | ArgoCD |
|
||||
| Git Hosting | Self-hosted Gitea/Forgejo |
|
||||
| Monitoring | Prometheus + Alertmanager + Grafana |
|
||||
|
||||
### CLI Tools Available
|
||||
|
||||
- `kubectl`
|
||||
- `argocd`
|
||||
- `k0sctl`
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
### Three-Layer Design
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ User Interface │
|
||||
│ Terminal (CLI) | Dashboard (Web) │
|
||||
└─────────────────────┬───────────────────┬───────────────────┘
|
||||
│ │
|
||||
┌─────────────────────▼───────────────────▼───────────────────┐
|
||||
│ Orchestrator Layer │
|
||||
│ k8s-orchestrator │
|
||||
│ (Opus - complex reasoning, task delegation) │
|
||||
└─────────────────────┬───────────────────────────────────────┘
|
||||
│ delegates to
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ Specialist Layer │
|
||||
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
|
||||
│ │k8s- │ │argocd- │ │prometheus- │ │git- ││
|
||||
│ │diagnostician│ │operator │ │analyst │ │operator ││
|
||||
│ │(Sonnet) │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) ││
|
||||
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
│ defined by
|
||||
┌─────────────────────▼───────────────────────────────────────┐
|
||||
│ Workflow Layer │
|
||||
│ YAML (complex) | Markdown (simple) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
~/.claude/
|
||||
├── settings.json # Agent definitions, autonomy rules
|
||||
├── agents/
|
||||
│ ├── k8s-orchestrator.md # Orchestrator prompt
|
||||
│ ├── k8s-diagnostician.md # Cluster diagnostics specialist
|
||||
│ ├── argocd-operator.md # GitOps operations specialist
|
||||
│ ├── prometheus-analyst.md # Metrics analysis specialist
|
||||
│ └── git-operator.md # Git/Gitea operations specialist
|
||||
├── workflows/
|
||||
│ ├── health/
|
||||
│ │ ├── cluster-health-check.yaml
|
||||
│ │ └── node-pressure-response.yaml
|
||||
│ ├── deploy/
|
||||
│ │ ├── deploy-app.md
|
||||
│ │ └── rollback-app.yaml
|
||||
│ └── incidents/
|
||||
│ └── pod-crashloop.yaml
|
||||
├── skills/
|
||||
│ ├── cluster-status.md
|
||||
│ ├── deploy.md
|
||||
│ ├── diagnose.md
|
||||
│ ├── rollback.md
|
||||
│ └── workflow.md
|
||||
├── logs/
|
||||
│ ├── actions/ # Action audit trail
|
||||
│ └── workflows/ # Workflow execution logs
|
||||
└── docs/plans/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Subagent Definitions
|
||||
|
||||
### settings.json
|
||||
|
||||
```json
|
||||
{
|
||||
"agents": {
|
||||
"k8s-orchestrator": {
|
||||
"model": "opus",
|
||||
"promptFile": "agents/k8s-orchestrator.md"
|
||||
},
|
||||
"k8s-diagnostician": {
|
||||
"model": "sonnet",
|
||||
"promptFile": "agents/k8s-diagnostician.md"
|
||||
},
|
||||
"argocd-operator": {
|
||||
"model": "sonnet",
|
||||
"promptFile": "agents/argocd-operator.md"
|
||||
},
|
||||
"prometheus-analyst": {
|
||||
"model": "sonnet",
|
||||
"promptFile": "agents/prometheus-analyst.md"
|
||||
},
|
||||
"git-operator": {
|
||||
"model": "sonnet",
|
||||
"promptFile": "agents/git-operator.md"
|
||||
}
|
||||
},
|
||||
"autonomy": {
|
||||
"safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
|
||||
"confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
|
||||
"forbidden_actions": ["drain", "cordon", "delete node", "reset"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Subagent Responsibilities
|
||||
|
||||
| Agent | Scope | Tools |
|
||||
|-------|-------|-------|
|
||||
| **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) |
|
||||
| **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools |
|
||||
| **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl |
|
||||
| **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API |
|
||||
| **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API |
|
||||
|
||||
---
|
||||
|
||||
## Model Assignment
|
||||
|
||||
### Defaults
|
||||
|
||||
- **Orchestrator**: Opus (complex reasoning, task delegation)
|
||||
- **Subagents**: Sonnet (standard operations)
|
||||
|
||||
### Override Levels
|
||||
|
||||
1. **Per-workflow**: Specify in workflow YAML
|
||||
2. **Per-step**: Specify for individual workflow steps
|
||||
3. **Dynamic**: Orchestrator selects based on task complexity
|
||||
|
||||
### Dynamic Model Selection (Orchestrator Logic)
|
||||
|
||||
| Task Complexity | Model | Examples |
|
||||
|-----------------|-------|----------|
|
||||
| Simple | Haiku | Get status, list resources, log tail |
|
||||
| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
|
||||
| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |
|
||||
|
||||
**Delegation syntax:**
|
||||
```markdown
|
||||
Delegate to k8s-diagnostician (haiku):
|
||||
Task: Get current node status
|
||||
|
||||
Delegate to prometheus-analyst (sonnet):
|
||||
Task: Analyze memory trends for namespace "prod" over last 24h
|
||||
|
||||
Delegate to k8s-diagnostician (opus):
|
||||
Task: Investigate cascading failure across multiple services
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow Definitions
|
||||
|
||||
### YAML Workflows (Complex)
|
||||
|
||||
```yaml
|
||||
name: cluster-health-check
|
||||
description: Comprehensive cluster health assessment
|
||||
model: sonnet # optional default override
|
||||
trigger:
|
||||
- schedule: "0 */6 * * *" # every 6 hours
|
||||
- manual: true
|
||||
|
||||
steps:
|
||||
- agent: k8s-diagnostician
|
||||
model: haiku # simple status check
|
||||
task: Check node status and resource pressure
|
||||
|
||||
- agent: prometheus-analyst
|
||||
task: Query for anomalies in last 6 hours
|
||||
|
||||
- agent: argocd-operator
|
||||
model: haiku
|
||||
task: Check all apps sync status
|
||||
|
||||
- agent: k8s-orchestrator
|
||||
task: Summarize findings and recommend actions
|
||||
confirm_if: actions_proposed
|
||||
```
|
||||
|
||||
### Markdown Workflows (Simple)
|
||||
|
||||
```markdown
|
||||
# Deploy New App
|
||||
|
||||
When asked to deploy a new application:
|
||||
|
||||
1. Ask git-operator to create the manifest structure in the GitOps repo
|
||||
2. Ask argocd-operator to create and sync the ArgoCD application
|
||||
3. Ask k8s-diagnostician to verify pods are running
|
||||
4. Report deployment status
|
||||
```
|
||||
|
||||
### Incident Response Workflow Example
|
||||
|
||||
```yaml
|
||||
name: pod-crashloop-remediation
|
||||
trigger:
|
||||
type: alert
|
||||
match:
|
||||
alertname: KubePodCrashLooping
|
||||
|
||||
steps:
|
||||
- name: diagnose
|
||||
agent: k8s-diagnostician
|
||||
action: get-pod-status
|
||||
inputs:
|
||||
namespace: "{{ alert.labels.namespace }}"
|
||||
pod: "{{ alert.labels.pod }}"
|
||||
|
||||
- name: check-logs
|
||||
agent: k8s-diagnostician
|
||||
action: analyze-logs
|
||||
inputs:
|
||||
pod: "{{ steps.diagnose.pod }}"
|
||||
lines: 100
|
||||
|
||||
- name: decide-action
|
||||
condition: "{{ steps.check-logs.cause == 'oom' }}"
|
||||
branches:
|
||||
true:
|
||||
agent: argocd-operator
|
||||
action: update-resources
|
||||
confirm: true # risky action
|
||||
false:
|
||||
agent: k8s-diagnostician
|
||||
action: restart-pod
|
||||
confirm: false # safe action
|
||||
|
||||
- name: notify
|
||||
action: report
|
||||
outputs:
|
||||
- summary
|
||||
- actions-taken
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Autonomy Model
|
||||
|
||||
### Tiered Autonomy
|
||||
|
||||
| Action Type | Behavior | Examples |
|
||||
|-------------|----------|----------|
|
||||
| **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod |
|
||||
| **Confirm** | Require user approval | delete, patch, scale, apply, modify config |
|
||||
| **Forbidden** | Reject with explanation | drain, cordon, delete node |
|
||||
|
||||
### Confirmation Flow
|
||||
|
||||
```
|
||||
1. Agent proposes action with rationale
|
||||
2. System checks action against autonomy rules
|
||||
3. If safe → execute immediately, log action
|
||||
4. If confirm → present to user (CLI prompt or dashboard queue)
|
||||
5. If forbidden → reject with explanation
|
||||
```
|
||||
|
||||
### Per-Workflow Overrides
|
||||
|
||||
```yaml
|
||||
name: emergency-pod-restart
|
||||
autonomy:
|
||||
auto_approve:
|
||||
- restart_pod
|
||||
- scale_replicas
|
||||
always_confirm:
|
||||
- delete_pvc
|
||||
```
|
||||
|
||||
### Action Logging
|
||||
|
||||
```
|
||||
~/.claude/logs/actions/2025-12-26-actions.jsonl
|
||||
```
|
||||
|
||||
Each entry includes:
|
||||
- Timestamp
|
||||
- Agent
|
||||
- Action
|
||||
- Inputs
|
||||
- Outcome
|
||||
- Approval type (auto/user-confirmed)
|
||||
|
||||
---
|
||||
|
||||
## Skills (User Entry Points)
|
||||
|
||||
| Skill | Command | Purpose |
|
||||
|-------|---------|---------|
|
||||
| cluster-status | `/cluster-status` | Quick health overview |
|
||||
| deploy | `/deploy <app>` | Deploy or update an app |
|
||||
| diagnose | `/diagnose <issue>` | Investigate a problem |
|
||||
| rollback | `/rollback <app>` | Revert to previous version |
|
||||
| workflow | `/workflow <name>` | Run a named workflow |
|
||||
|
||||
### Example Skill: cluster-status.md
|
||||
|
||||
```markdown
|
||||
# Cluster Status
|
||||
|
||||
Invoke the k8s-orchestrator to provide a quick health overview.
|
||||
|
||||
## Steps
|
||||
1. Delegate to k8s-diagnostician: get node status
|
||||
2. Delegate to prometheus-analyst: check for active alerts
|
||||
3. Delegate to argocd-operator: list out-of-sync apps
|
||||
4. Summarize in a concise table
|
||||
|
||||
## Output Format
|
||||
- Node health: table
|
||||
- Active alerts: bullet list
|
||||
- ArgoCD status: table
|
||||
- Recommendations: if any issues found
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Interaction Methods
|
||||
|
||||
### Terminal/CLI
|
||||
|
||||
- Primary interaction via Claude Code
|
||||
- Fallback when cluster is unavailable
|
||||
- Use skills to invoke workflows
|
||||
|
||||
### Dashboard (Web UI)
|
||||
|
||||
- Deployed on cluster (Pi 3 node)
|
||||
- Views: Status, Pending Confirmations, History, Workflows
|
||||
- Approve/reject risky actions
|
||||
|
||||
### Push Notifications (Future)
|
||||
|
||||
- Discord, Slack, or Telegram integration
|
||||
- Alert on issues requiring attention
|
||||
|
||||
---
|
||||
|
||||
## Dashboard Specification
|
||||
|
||||
### Tech Stack
|
||||
|
||||
- **Backend**: Go binary (single static binary, embedded assets)
|
||||
- **Storage**: SQLite or flat JSON files
|
||||
- **Resources**: Minimal footprint for Pi 3
|
||||
|
||||
### Deployment
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: k8s-agent-dashboard
|
||||
spec:
|
||||
replicas: 1
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: dashboard
|
||||
image: k8s-agent-dashboard:latest
|
||||
resources:
|
||||
requests:
|
||||
memory: "32Mi"
|
||||
cpu: "10m"
|
||||
limits:
|
||||
memory: "64Mi"
|
||||
cpu: "100m"
|
||||
tolerations:
|
||||
- key: "node-type"
|
||||
operator: "Equal"
|
||||
value: "pi3"
|
||||
effect: "NoSchedule"
|
||||
nodeSelector:
|
||||
kubernetes.io/arch: arm64
|
||||
```
|
||||
|
||||
### Views
|
||||
|
||||
| View | Description |
|
||||
|------|-------------|
|
||||
| Status | Current cluster health, active alerts, ArgoCD sync state |
|
||||
| Pending | Actions awaiting confirmation with approve/reject buttons |
|
||||
| History | Recent actions taken, filterable by agent/workflow |
|
||||
| Workflows | List of defined workflows, manual trigger capability |
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Core Agent System
|
||||
|
||||
**Deliverables:**
|
||||
- `~/.claude/` directory structure
|
||||
- Orchestrator and 4 subagent prompt files
|
||||
- `settings.json` with agent configurations
|
||||
- 3-4 essential workflows (cluster-health, deploy, diagnose)
|
||||
- Core skills (/cluster-status, /deploy, /diagnose)
|
||||
|
||||
**Validation:**
|
||||
- Manual CLI invocation
|
||||
- Test each subagent independently
|
||||
- Run health check workflow end-to-end
|
||||
|
||||
### Phase 2: Dashboard
|
||||
|
||||
**Deliverables:**
|
||||
- Go-based dashboard application
|
||||
- Kubernetes manifests for Pi 3 deployment
|
||||
- Pending confirmations queue
|
||||
- Action history view
|
||||
- Approval flow integration
|
||||
|
||||
### Phase 3: Automation
|
||||
|
||||
**Deliverables:**
|
||||
- Scheduled workflow execution
|
||||
- Alertmanager webhook integration
|
||||
- Expanded incident response workflows
|
||||
|
||||
### Phase 4: Expansion (Future)
|
||||
|
||||
**Potential additions:**
|
||||
- Push notifications (Discord/Telegram)
|
||||
- Additional domains (development, research, productivity)
|
||||
- SDK-based background daemon for true autonomy
|
||||
|
||||
---
|
||||
|
||||
## Future Domain Expansion
|
||||
|
||||
The system is designed to expand beyond DevOps:
|
||||
|
||||
| Domain | Use Cases |
|
||||
|--------|-----------|
|
||||
| Software Development | Code generation, refactoring, testing across repos |
|
||||
| Research & Analysis | Information gathering, summarizing, recommendations |
|
||||
| Personal Productivity | File management, notes, task tracking |
|
||||
|
||||
New domains would add:
|
||||
- Additional subagents with specialized prompts
|
||||
- Domain-specific workflows
|
||||
- New skills for user invocation
|
||||
Reference in New Issue
Block a user