- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
14 KiB
K8s Agent Orchestrator System - Design Document
Overview
A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.
Location: ~/.claude/
Primary Domain: DevOps/Infrastructure
Target: Raspberry Pi k0s cluster
Cluster Environment
Hardware
| Node | Hardware | RAM | Role |
|---|---|---|---|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
- Architecture: All nodes run arm64 (64-bit OS)
- Pi 3 node: Reserved for lightweight workloads only
Stack
| Component | Technology |
|---|---|
| K8s Distribution | k0s |
| GitOps | ArgoCD |
| Git Hosting | Self-hosted Gitea/Forgejo |
| Monitoring | Prometheus + Alertmanager + Grafana |
CLI Tools Available
kubectlargocdk0sctl
Architecture
Three-Layer Design
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ Terminal (CLI) | Dashboard (Web) │
└─────────────────────┬───────────────────┬───────────────────┘
│ │
┌─────────────────────▼───────────────────▼───────────────────┐
│ Orchestrator Layer │
│ k8s-orchestrator │
│ (Opus - complex reasoning, task delegation) │
└─────────────────────┬───────────────────────────────────────┘
│ delegates to
┌─────────────────────▼───────────────────────────────────────┐
│ Specialist Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
│ │k8s- │ │argocd- │ │prometheus- │ │git- ││
│ │diagnostician│ │operator │ │analyst │ │operator ││
│ │(Sonnet) │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) ││
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────┘
│ defined by
┌─────────────────────▼───────────────────────────────────────┐
│ Workflow Layer │
│ YAML (complex) | Markdown (simple) │
└─────────────────────────────────────────────────────────────┘
Directory Structure
~/.claude/
├── settings.json # Agent definitions, autonomy rules
├── agents/
│ ├── k8s-orchestrator.md # Orchestrator prompt
│ ├── k8s-diagnostician.md # Cluster diagnostics specialist
│ ├── argocd-operator.md # GitOps operations specialist
│ ├── prometheus-analyst.md # Metrics analysis specialist
│ └── git-operator.md # Git/Gitea operations specialist
├── workflows/
│ ├── health/
│ │ ├── cluster-health-check.yaml
│ │ └── node-pressure-response.yaml
│ ├── deploy/
│ │ ├── deploy-app.md
│ │ └── rollback-app.yaml
│ └── incidents/
│ └── pod-crashloop.yaml
├── skills/
│ ├── cluster-status.md
│ ├── deploy.md
│ ├── diagnose.md
│ ├── rollback.md
│ └── workflow.md
├── logs/
│ ├── actions/ # Action audit trail
│ └── workflows/ # Workflow execution logs
└── docs/plans/
Subagent Definitions
settings.json
{
"agents": {
"k8s-orchestrator": {
"model": "opus",
"promptFile": "agents/k8s-orchestrator.md"
},
"k8s-diagnostician": {
"model": "sonnet",
"promptFile": "agents/k8s-diagnostician.md"
},
"argocd-operator": {
"model": "sonnet",
"promptFile": "agents/argocd-operator.md"
},
"prometheus-analyst": {
"model": "sonnet",
"promptFile": "agents/prometheus-analyst.md"
},
"git-operator": {
"model": "sonnet",
"promptFile": "agents/git-operator.md"
}
},
"autonomy": {
"safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
"confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
"forbidden_actions": ["drain", "cordon", "delete node", "reset"]
}
}
Subagent Responsibilities
| Agent | Scope | Tools |
|---|---|---|
| k8s-orchestrator | Task analysis, delegation, decision making | All (via delegation) |
| k8s-diagnostician | Cluster health, pod/node status, logs | kubectl, log tools |
| argocd-operator | App sync, deployments, rollbacks | argocd CLI, kubectl |
| prometheus-analyst | Metrics, alerts, trends | PromQL, Prometheus API |
| git-operator | Manifest commits, PRs, GitOps repo | git, Gitea API |
Model Assignment
Defaults
- Orchestrator: Opus (complex reasoning, task delegation)
- Subagents: Sonnet (standard operations)
Override Levels
- Per-workflow: Specify in workflow YAML
- Per-step: Specify for individual workflow steps
- Dynamic: Orchestrator selects based on task complexity
Dynamic Model Selection (Orchestrator Logic)
| Task Complexity | Model | Examples |
|---|---|---|
| Simple | Haiku | Get status, list resources, log tail |
| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |
Delegation syntax:
Delegate to k8s-diagnostician (haiku):
Task: Get current node status
Delegate to prometheus-analyst (sonnet):
Task: Analyze memory trends for namespace "prod" over last 24h
Delegate to k8s-diagnostician (opus):
Task: Investigate cascading failure across multiple services
Workflow Definitions
YAML Workflows (Complex)
name: cluster-health-check
description: Comprehensive cluster health assessment
model: sonnet # optional default override
trigger:
- schedule: "0 */6 * * *" # every 6 hours
- manual: true
steps:
- agent: k8s-diagnostician
model: haiku # simple status check
task: Check node status and resource pressure
- agent: prometheus-analyst
task: Query for anomalies in last 6 hours
- agent: argocd-operator
model: haiku
task: Check all apps sync status
- agent: k8s-orchestrator
task: Summarize findings and recommend actions
confirm_if: actions_proposed
Markdown Workflows (Simple)
# Deploy New App
When asked to deploy a new application:
1. Ask git-operator to create the manifest structure in the GitOps repo
2. Ask argocd-operator to create and sync the ArgoCD application
3. Ask k8s-diagnostician to verify pods are running
4. Report deployment status
Incident Response Workflow Example
name: pod-crashloop-remediation
trigger:
type: alert
match:
alertname: KubePodCrashLooping
steps:
- name: diagnose
agent: k8s-diagnostician
action: get-pod-status
inputs:
namespace: "{{ alert.labels.namespace }}"
pod: "{{ alert.labels.pod }}"
- name: check-logs
agent: k8s-diagnostician
action: analyze-logs
inputs:
pod: "{{ steps.diagnose.pod }}"
lines: 100
- name: decide-action
condition: "{{ steps.check-logs.cause == 'oom' }}"
branches:
true:
agent: argocd-operator
action: update-resources
confirm: true # risky action
false:
agent: k8s-diagnostician
action: restart-pod
confirm: false # safe action
- name: notify
action: report
outputs:
- summary
- actions-taken
Autonomy Model
Tiered Autonomy
| Action Type | Behavior | Examples |
|---|---|---|
| Safe | Auto-execute, log action | get, describe, logs, list, restart pod |
| Confirm | Require user approval | delete, patch, scale, apply, modify config |
| Forbidden | Reject with explanation | drain, cordon, delete node |
Confirmation Flow
1. Agent proposes action with rationale
2. System checks action against autonomy rules
3. If safe → execute immediately, log action
4. If confirm → present to user (CLI prompt or dashboard queue)
5. If forbidden → reject with explanation
Per-Workflow Overrides
name: emergency-pod-restart
autonomy:
auto_approve:
- restart_pod
- scale_replicas
always_confirm:
- delete_pvc
Action Logging
~/.claude/logs/actions/2025-12-26-actions.jsonl
Each entry includes:
- Timestamp
- Agent
- Action
- Inputs
- Outcome
- Approval type (auto/user-confirmed)
Skills (User Entry Points)
| Skill | Command | Purpose |
|---|---|---|
| cluster-status | /cluster-status |
Quick health overview |
| deploy | /deploy <app> |
Deploy or update an app |
| diagnose | /diagnose <issue> |
Investigate a problem |
| rollback | /rollback <app> |
Revert to previous version |
| workflow | /workflow <name> |
Run a named workflow |
Example Skill: cluster-status.md
# Cluster Status
Invoke the k8s-orchestrator to provide a quick health overview.
## Steps
1. Delegate to k8s-diagnostician: get node status
2. Delegate to prometheus-analyst: check for active alerts
3. Delegate to argocd-operator: list out-of-sync apps
4. Summarize in a concise table
## Output Format
- Node health: table
- Active alerts: bullet list
- ArgoCD status: table
- Recommendations: if any issues found
Interaction Methods
Terminal/CLI
- Primary interaction via Claude Code
- Fallback when cluster is unavailable
- Use skills to invoke workflows
Dashboard (Web UI)
- Deployed on cluster (Pi 3 node)
- Views: Status, Pending Confirmations, History, Workflows
- Approve/reject risky actions
Push Notifications (Future)
- Discord, Slack, or Telegram integration
- Alert on issues requiring attention
Dashboard Specification
Tech Stack
- Backend: Go binary (single static binary, embedded assets)
- Storage: SQLite or flat JSON files
- Resources: Minimal footprint for Pi 3
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-agent-dashboard
spec:
replicas: 1
template:
spec:
containers:
- name: dashboard
image: k8s-agent-dashboard:latest
resources:
requests:
memory: "32Mi"
cpu: "10m"
limits:
memory: "64Mi"
cpu: "100m"
tolerations:
- key: "node-type"
operator: "Equal"
value: "pi3"
effect: "NoSchedule"
nodeSelector:
kubernetes.io/arch: arm64
Views
| View | Description |
|---|---|
| Status | Current cluster health, active alerts, ArgoCD sync state |
| Pending | Actions awaiting confirmation with approve/reject buttons |
| History | Recent actions taken, filterable by agent/workflow |
| Workflows | List of defined workflows, manual trigger capability |
Implementation Phases
Phase 1: Core Agent System
Deliverables:
~/.claude/directory structure- Orchestrator and 4 subagent prompt files
settings.jsonwith agent configurations- 3-4 essential workflows (cluster-health, deploy, diagnose)
- Core skills (/cluster-status, /deploy, /diagnose)
Validation:
- Manual CLI invocation
- Test each subagent independently
- Run health check workflow end-to-end
Phase 2: Dashboard
Deliverables:
- Go-based dashboard application
- Kubernetes manifests for Pi 3 deployment
- Pending confirmations queue
- Action history view
- Approval flow integration
Phase 3: Automation
Deliverables:
- Scheduled workflow execution
- Alertmanager webhook integration
- Expanded incident response workflows
Phase 4: Expansion (Future)
Potential additions:
- Push notifications (Discord/Telegram)
- Additional domains (development, research, productivity)
- SDK-based background daemon for true autonomy
Future Domain Expansion
The system is designed to expand beyond DevOps:
| Domain | Use Cases |
|---|---|
| Software Development | Code generation, refactoring, testing across repos |
| Research & Analysis | Information gathering, summarizing, recommendations |
| Personal Productivity | File management, notes, task tracking |
New domains would add:
- Additional subagents with specialized prompts
- Domain-specific workflows
- New skills for user invocation