claude-code/docs/plans/2025-12-26-k8s-agent-orchestrator-design.md

# K8s Agent Orchestrator System - Design Document

## Overview

A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.

**Location**: `~/.claude/`
**Primary Domain**: DevOps/Infrastructure
**Target**: Raspberry Pi k0s cluster

---

## Cluster Environment

### Hardware

| Node | Hardware | RAM | Role |
|------|----------|-----|------|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |

- **Architecture**: All nodes run arm64 (64-bit OS)
- **Pi 3 node**: Reserved for lightweight workloads only

### Stack

| Component | Technology |
|-----------|------------|
| K8s Distribution | k0s |
| GitOps | ArgoCD |
| Git Hosting | Self-hosted Gitea/Forgejo |
| Monitoring | Prometheus + Alertmanager + Grafana |

### CLI Tools Available

- `kubectl`
- `argocd`
- `k0sctl`

---

## Architecture

### Three-Layer Design

```
┌─────────────────────────────────────────────────────────────┐
│                     User Interface                          │
│              Terminal (CLI)  |  Dashboard (Web)             │
└─────────────────────┬───────────────────┬───────────────────┘
                      │                   │
┌─────────────────────▼───────────────────▼───────────────────┐
│                   Orchestrator Layer                         │
│                    k8s-orchestrator                          │
│         (Opus - complex reasoning, task delegation)          │
└─────────────────────┬───────────────────────────────────────┘
                      │ delegates to
┌─────────────────────▼───────────────────────────────────────┐
│                   Specialist Layer                           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
│  │k8s-         │ │argocd-      │ │prometheus-  │ │git-     ││
│  │diagnostician│ │operator     │ │analyst      │ │operator ││
│  │(Sonnet)     │ │(Sonnet)     │ │(Sonnet)     │ │(Sonnet) ││
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────┘
                      │ defined by
┌─────────────────────▼───────────────────────────────────────┐
│                   Workflow Layer                             │
│            YAML (complex)  |  Markdown (simple)              │
└─────────────────────────────────────────────────────────────┘
```

### Directory Structure

```
~/.claude/
├── settings.json              # Agent definitions, autonomy rules
├── agents/
│   ├── k8s-orchestrator.md    # Orchestrator prompt
│   ├── k8s-diagnostician.md   # Cluster diagnostics specialist
│   ├── argocd-operator.md     # GitOps operations specialist
│   ├── prometheus-analyst.md  # Metrics analysis specialist
│   └── git-operator.md        # Git/Gitea operations specialist
├── workflows/
│   ├── health/
│   │   ├── cluster-health-check.yaml
│   │   └── node-pressure-response.yaml
│   ├── deploy/
│   │   ├── deploy-app.md
│   │   └── rollback-app.yaml
│   └── incidents/
│       └── pod-crashloop.yaml
├── skills/
│   ├── cluster-status.md
│   ├── deploy.md
│   ├── diagnose.md
│   ├── rollback.md
│   └── workflow.md
├── logs/
│   ├── actions/               # Action audit trail
│   └── workflows/             # Workflow execution logs
└── docs/plans/
```

---

## Subagent Definitions

### settings.json

```json
{
  "agents": {
    "k8s-orchestrator": {
      "model": "opus",
      "promptFile": "agents/k8s-orchestrator.md"
    },
    "k8s-diagnostician": {
      "model": "sonnet",
      "promptFile": "agents/k8s-diagnostician.md"
    },
    "argocd-operator": {
      "model": "sonnet",
      "promptFile": "agents/argocd-operator.md"
    },
    "prometheus-analyst": {
      "model": "sonnet",
      "promptFile": "agents/prometheus-analyst.md"
    },
    "git-operator": {
      "model": "sonnet",
      "promptFile": "agents/git-operator.md"
    }
  },
  "autonomy": {
    "safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
    "confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
    "forbidden_actions": ["drain", "cordon", "delete node", "reset"]
  }
}
```

### Subagent Responsibilities

| Agent | Scope | Tools |
|-------|-------|-------|
| **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) |
| **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools |
| **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl |
| **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API |
| **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API |

---

## Model Assignment

### Defaults

- **Orchestrator**: Opus (complex reasoning, task delegation)
- **Subagents**: Sonnet (standard operations)

### Override Levels

1. **Per-workflow**: Specify in workflow YAML
2. **Per-step**: Specify for individual workflow steps
3. **Dynamic**: Orchestrator selects based on task complexity

### Dynamic Model Selection (Orchestrator Logic)

| Task Complexity | Model | Examples |
|-----------------|-------|----------|
| Simple | Haiku | Get status, list resources, log tail |
| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |

**Delegation syntax:**
```markdown
Delegate to k8s-diagnostician (haiku):
  Task: Get current node status

Delegate to prometheus-analyst (sonnet):
  Task: Analyze memory trends for namespace "prod" over last 24h

Delegate to k8s-diagnostician (opus):
  Task: Investigate cascading failure across multiple services
```

---

## Workflow Definitions

### YAML Workflows (Complex)

```yaml
name: cluster-health-check
description: Comprehensive cluster health assessment
model: sonnet  # optional default override
trigger:
  - schedule: "0 */6 * * *"  # every 6 hours
  - manual: true

steps:
  - agent: k8s-diagnostician
    model: haiku  # simple status check
    task: Check node status and resource pressure

  - agent: prometheus-analyst
    task: Query for anomalies in last 6 hours

  - agent: argocd-operator
    model: haiku
    task: Check all apps sync status

  - agent: k8s-orchestrator
    task: Summarize findings and recommend actions
    confirm_if: actions_proposed
```

### Markdown Workflows (Simple)

```markdown
# Deploy New App

When asked to deploy a new application:

1. Ask git-operator to create the manifest structure in the GitOps repo
2. Ask argocd-operator to create and sync the ArgoCD application
3. Ask k8s-diagnostician to verify pods are running
4. Report deployment status
```

### Incident Response Workflow Example

```yaml
name: pod-crashloop-remediation
trigger:
  type: alert
  match:
    alertname: KubePodCrashLooping

steps:
  - name: diagnose
    agent: k8s-diagnostician
    action: get-pod-status
    inputs:
      namespace: "{{ alert.labels.namespace }}"
      pod: "{{ alert.labels.pod }}"

  - name: check-logs
    agent: k8s-diagnostician
    action: analyze-logs
    inputs:
      pod: "{{ steps.diagnose.pod }}"
      lines: 100

  - name: decide-action
    condition: "{{ steps.check-logs.cause == 'oom' }}"
    branches:
      true:
        agent: argocd-operator
        action: update-resources
        confirm: true  # risky action
      false:
        agent: k8s-diagnostician
        action: restart-pod
        confirm: false  # safe action

  - name: notify
    action: report
    outputs:
      - summary
      - actions-taken
```

---

## Autonomy Model

### Tiered Autonomy

| Action Type | Behavior | Examples |
|-------------|----------|----------|
| **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod |
| **Confirm** | Require user approval | delete, patch, scale, apply, modify config |
| **Forbidden** | Reject with explanation | drain, cordon, delete node |

### Confirmation Flow

```
1. Agent proposes action with rationale
2. System checks action against autonomy rules
3. If safe → execute immediately, log action
4. If confirm → present to user (CLI prompt or dashboard queue)
5. If forbidden → reject with explanation
```

### Per-Workflow Overrides

```yaml
name: emergency-pod-restart
autonomy:
  auto_approve:
    - restart_pod
    - scale_replicas
  always_confirm:
    - delete_pvc
```

### Action Logging

```
~/.claude/logs/actions/2025-12-26-actions.jsonl
```

Each entry includes:
- Timestamp
- Agent
- Action
- Inputs
- Outcome
- Approval type (auto/user-confirmed)

---

## Skills (User Entry Points)

| Skill | Command | Purpose |
|-------|---------|---------|
| cluster-status | `/cluster-status` | Quick health overview |
| deploy | `/deploy <app>` | Deploy or update an app |
| diagnose | `/diagnose <issue>` | Investigate a problem |
| rollback | `/rollback <app>` | Revert to previous version |
| workflow | `/workflow <name>` | Run a named workflow |

### Example Skill: cluster-status.md

```markdown
# Cluster Status

Invoke the k8s-orchestrator to provide a quick health overview.

## Steps
1. Delegate to k8s-diagnostician: get node status
2. Delegate to prometheus-analyst: check for active alerts
3. Delegate to argocd-operator: list out-of-sync apps
4. Summarize in a concise table

## Output Format
- Node health: table
- Active alerts: bullet list
- ArgoCD status: table
- Recommendations: if any issues found
```

---

## Interaction Methods

### Terminal/CLI

- Primary interaction via Claude Code
- Fallback when cluster is unavailable
- Use skills to invoke workflows

### Dashboard (Web UI)

- Deployed on cluster (Pi 3 node)
- Views: Status, Pending Confirmations, History, Workflows
- Approve/reject risky actions

### Push Notifications (Future)

- Discord, Slack, or Telegram integration
- Alert on issues requiring attention

---

## Dashboard Specification

### Tech Stack

- **Backend**: Go binary (single static binary, embedded assets)
- **Storage**: SQLite or flat JSON files
- **Resources**: Minimal footprint for Pi 3

### Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-agent-dashboard
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: dashboard
          image: k8s-agent-dashboard:latest
          resources:
            requests:
              memory: "32Mi"
              cpu: "10m"
            limits:
              memory: "64Mi"
              cpu: "100m"
      tolerations:
        - key: "node-type"
          operator: "Equal"
          value: "pi3"
          effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/arch: arm64
```

### Views

| View | Description |
|------|-------------|
| Status | Current cluster health, active alerts, ArgoCD sync state |
| Pending | Actions awaiting confirmation with approve/reject buttons |
| History | Recent actions taken, filterable by agent/workflow |
| Workflows | List of defined workflows, manual trigger capability |

---

## Implementation Phases

### Phase 1: Core Agent System

**Deliverables:**
- `~/.claude/` directory structure
- Orchestrator and 4 subagent prompt files
- `settings.json` with agent configurations
- 3-4 essential workflows (cluster-health, deploy, diagnose)
- Core skills (/cluster-status, /deploy, /diagnose)

**Validation:**
- Manual CLI invocation
- Test each subagent independently
- Run health check workflow end-to-end

### Phase 2: Dashboard

**Deliverables:**
- Go-based dashboard application
- Kubernetes manifests for Pi 3 deployment
- Pending confirmations queue
- Action history view
- Approval flow integration

### Phase 3: Automation

**Deliverables:**
- Scheduled workflow execution
- Alertmanager webhook integration
- Expanded incident response workflows

### Phase 4: Expansion (Future)

**Potential additions:**
- Push notifications (Discord/Telegram)
- Additional domains (development, research, productivity)
- SDK-based background daemon for true autonomy

---

## Future Domain Expansion

The system is designed to expand beyond DevOps:

| Domain | Use Cases |
|--------|-----------|
| Software Development | Code generation, refactoring, testing across repos |
| Research & Analysis | Information gathering, summarizing, recommendations |
| Personal Productivity | File management, notes, task tracking |

New domains would add:
- Additional subagents with specialized prompts
- Domain-specific workflows
- New skills for user invocation