# K8s Agent Orchestrator System - Design Document ## Overview A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model. **Location**: `~/.claude/` **Primary Domain**: DevOps/Infrastructure **Target**: Raspberry Pi k0s cluster --- ## Cluster Environment ### Hardware | Node | Hardware | RAM | Role | |------|----------|-----|------| | Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker | | Node 2 | Raspberry Pi 5 | 8GB | Worker | | Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) | - **Architecture**: All nodes run arm64 (64-bit OS) - **Pi 3 node**: Reserved for lightweight workloads only ### Stack | Component | Technology | |-----------|------------| | K8s Distribution | k0s | | GitOps | ArgoCD | | Git Hosting | Self-hosted Gitea/Forgejo | | Monitoring | Prometheus + Alertmanager + Grafana | ### CLI Tools Available - `kubectl` - `argocd` - `k0sctl` --- ## Architecture ### Three-Layer Design ``` ┌─────────────────────────────────────────────────────────────┐ │ User Interface │ │ Terminal (CLI) | Dashboard (Web) │ └─────────────────────┬───────────────────┬───────────────────┘ │ │ ┌─────────────────────▼───────────────────▼───────────────────┐ │ Orchestrator Layer │ │ k8s-orchestrator │ │ (Opus - complex reasoning, task delegation) │ └─────────────────────┬───────────────────────────────────────┘ │ delegates to ┌─────────────────────▼───────────────────────────────────────┐ │ Specialist Layer │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│ │ │k8s- │ │argocd- │ │prometheus- │ │git- ││ │ │diagnostician│ │operator │ │analyst │ │operator ││ │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) │ │(Sonnet) ││ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│ └─────────────────────────────────────────────────────────────┘ │ defined by ┌─────────────────────▼───────────────────────────────────────┐ │ Workflow Layer │ │ YAML (complex) | Markdown (simple) │ └─────────────────────────────────────────────────────────────┘ ``` ### Directory Structure ``` ~/.claude/ ├── settings.json # Agent definitions, autonomy rules ├── agents/ │ ├── k8s-orchestrator.md # Orchestrator prompt │ ├── k8s-diagnostician.md # Cluster diagnostics specialist │ ├── argocd-operator.md # GitOps operations specialist │ ├── prometheus-analyst.md # Metrics analysis specialist │ └── git-operator.md # Git/Gitea operations specialist ├── workflows/ │ ├── health/ │ │ ├── cluster-health-check.yaml │ │ └── node-pressure-response.yaml │ ├── deploy/ │ │ ├── deploy-app.md │ │ └── rollback-app.yaml │ └── incidents/ │ └── pod-crashloop.yaml ├── skills/ │ ├── cluster-status.md │ ├── deploy.md │ ├── diagnose.md │ ├── rollback.md │ └── workflow.md ├── logs/ │ ├── actions/ # Action audit trail │ └── workflows/ # Workflow execution logs └── docs/plans/ ``` --- ## Subagent Definitions ### settings.json ```json { "agents": { "k8s-orchestrator": { "model": "opus", "promptFile": "agents/k8s-orchestrator.md" }, "k8s-diagnostician": { "model": "sonnet", "promptFile": "agents/k8s-diagnostician.md" }, "argocd-operator": { "model": "sonnet", "promptFile": "agents/argocd-operator.md" }, "prometheus-analyst": { "model": "sonnet", "promptFile": "agents/prometheus-analyst.md" }, "git-operator": { "model": "sonnet", "promptFile": "agents/git-operator.md" } }, "autonomy": { "safe_actions": ["get", "describe", "logs", "list", "top", "diff"], "confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"], "forbidden_actions": ["drain", "cordon", "delete node", "reset"] } } ``` ### Subagent Responsibilities | Agent | Scope | Tools | |-------|-------|-------| | **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) | | **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools | | **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl | | **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API | | **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API | --- ## Model Assignment ### Defaults - **Orchestrator**: Opus (complex reasoning, task delegation) - **Subagents**: Sonnet (standard operations) ### Override Levels 1. **Per-workflow**: Specify in workflow YAML 2. **Per-step**: Specify for individual workflow steps 3. **Dynamic**: Orchestrator selects based on task complexity ### Dynamic Model Selection (Orchestrator Logic) | Task Complexity | Model | Examples | |-----------------|-------|----------| | Simple | Haiku | Get status, list resources, log tail | | Standard | Sonnet | Analyze logs, diagnose issues, sync apps | | Complex | Opus | Root cause analysis, cascading failures, trade-off decisions | **Delegation syntax:** ```markdown Delegate to k8s-diagnostician (haiku): Task: Get current node status Delegate to prometheus-analyst (sonnet): Task: Analyze memory trends for namespace "prod" over last 24h Delegate to k8s-diagnostician (opus): Task: Investigate cascading failure across multiple services ``` --- ## Workflow Definitions ### YAML Workflows (Complex) ```yaml name: cluster-health-check description: Comprehensive cluster health assessment model: sonnet # optional default override trigger: - schedule: "0 */6 * * *" # every 6 hours - manual: true steps: - agent: k8s-diagnostician model: haiku # simple status check task: Check node status and resource pressure - agent: prometheus-analyst task: Query for anomalies in last 6 hours - agent: argocd-operator model: haiku task: Check all apps sync status - agent: k8s-orchestrator task: Summarize findings and recommend actions confirm_if: actions_proposed ``` ### Markdown Workflows (Simple) ```markdown # Deploy New App When asked to deploy a new application: 1. Ask git-operator to create the manifest structure in the GitOps repo 2. Ask argocd-operator to create and sync the ArgoCD application 3. Ask k8s-diagnostician to verify pods are running 4. Report deployment status ``` ### Incident Response Workflow Example ```yaml name: pod-crashloop-remediation trigger: type: alert match: alertname: KubePodCrashLooping steps: - name: diagnose agent: k8s-diagnostician action: get-pod-status inputs: namespace: "{{ alert.labels.namespace }}" pod: "{{ alert.labels.pod }}" - name: check-logs agent: k8s-diagnostician action: analyze-logs inputs: pod: "{{ steps.diagnose.pod }}" lines: 100 - name: decide-action condition: "{{ steps.check-logs.cause == 'oom' }}" branches: true: agent: argocd-operator action: update-resources confirm: true # risky action false: agent: k8s-diagnostician action: restart-pod confirm: false # safe action - name: notify action: report outputs: - summary - actions-taken ``` --- ## Autonomy Model ### Tiered Autonomy | Action Type | Behavior | Examples | |-------------|----------|----------| | **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod | | **Confirm** | Require user approval | delete, patch, scale, apply, modify config | | **Forbidden** | Reject with explanation | drain, cordon, delete node | ### Confirmation Flow ``` 1. Agent proposes action with rationale 2. System checks action against autonomy rules 3. If safe → execute immediately, log action 4. If confirm → present to user (CLI prompt or dashboard queue) 5. If forbidden → reject with explanation ``` ### Per-Workflow Overrides ```yaml name: emergency-pod-restart autonomy: auto_approve: - restart_pod - scale_replicas always_confirm: - delete_pvc ``` ### Action Logging ``` ~/.claude/logs/actions/2025-12-26-actions.jsonl ``` Each entry includes: - Timestamp - Agent - Action - Inputs - Outcome - Approval type (auto/user-confirmed) --- ## Skills (User Entry Points) | Skill | Command | Purpose | |-------|---------|---------| | cluster-status | `/cluster-status` | Quick health overview | | deploy | `/deploy ` | Deploy or update an app | | diagnose | `/diagnose ` | Investigate a problem | | rollback | `/rollback ` | Revert to previous version | | workflow | `/workflow ` | Run a named workflow | ### Example Skill: cluster-status.md ```markdown # Cluster Status Invoke the k8s-orchestrator to provide a quick health overview. ## Steps 1. Delegate to k8s-diagnostician: get node status 2. Delegate to prometheus-analyst: check for active alerts 3. Delegate to argocd-operator: list out-of-sync apps 4. Summarize in a concise table ## Output Format - Node health: table - Active alerts: bullet list - ArgoCD status: table - Recommendations: if any issues found ``` --- ## Interaction Methods ### Terminal/CLI - Primary interaction via Claude Code - Fallback when cluster is unavailable - Use skills to invoke workflows ### Dashboard (Web UI) - Deployed on cluster (Pi 3 node) - Views: Status, Pending Confirmations, History, Workflows - Approve/reject risky actions ### Push Notifications (Future) - Discord, Slack, or Telegram integration - Alert on issues requiring attention --- ## Dashboard Specification ### Tech Stack - **Backend**: Go binary (single static binary, embedded assets) - **Storage**: SQLite or flat JSON files - **Resources**: Minimal footprint for Pi 3 ### Deployment ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: k8s-agent-dashboard spec: replicas: 1 template: spec: containers: - name: dashboard image: k8s-agent-dashboard:latest resources: requests: memory: "32Mi" cpu: "10m" limits: memory: "64Mi" cpu: "100m" tolerations: - key: "node-type" operator: "Equal" value: "pi3" effect: "NoSchedule" nodeSelector: kubernetes.io/arch: arm64 ``` ### Views | View | Description | |------|-------------| | Status | Current cluster health, active alerts, ArgoCD sync state | | Pending | Actions awaiting confirmation with approve/reject buttons | | History | Recent actions taken, filterable by agent/workflow | | Workflows | List of defined workflows, manual trigger capability | --- ## Implementation Phases ### Phase 1: Core Agent System **Deliverables:** - `~/.claude/` directory structure - Orchestrator and 4 subagent prompt files - `settings.json` with agent configurations - 3-4 essential workflows (cluster-health, deploy, diagnose) - Core skills (/cluster-status, /deploy, /diagnose) **Validation:** - Manual CLI invocation - Test each subagent independently - Run health check workflow end-to-end ### Phase 2: Dashboard **Deliverables:** - Go-based dashboard application - Kubernetes manifests for Pi 3 deployment - Pending confirmations queue - Action history view - Approval flow integration ### Phase 3: Automation **Deliverables:** - Scheduled workflow execution - Alertmanager webhook integration - Expanded incident response workflows ### Phase 4: Expansion (Future) **Potential additions:** - Push notifications (Discord/Telegram) - Additional domains (development, research, productivity) - SDK-based background daemon for true autonomy --- ## Future Domain Expansion The system is designed to expand beyond DevOps: | Domain | Use Cases | |--------|-----------| | Software Development | Code generation, refactoring, testing across repos | | Research & Analysis | Information gathering, summarizing, recommendations | | Personal Productivity | File management, notes, task tracking | New domains would add: - Additional subagents with specialized prompts - Domain-specific workflows - New skills for user invocation