Files
claude-code/docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
OpenCode Test 216a95cec4 Initial commit: Claude Code config and K8s agent orchestrator design
- Add .gitignore for logs, caches, credentials, and history
- Add K8s agent orchestrator design document
- Include existing Claude Code settings and plugin configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:16:07 -08:00

14 KiB

K8s Agent Orchestrator System - Design Document

Overview

A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.

Location: ~/.claude/ Primary Domain: DevOps/Infrastructure Target: Raspberry Pi k0s cluster


Cluster Environment

Hardware

Node Hardware RAM Role
Node 1 Raspberry Pi 5 8GB Control plane + Worker
Node 2 Raspberry Pi 5 8GB Worker
Node 3 Raspberry Pi 3B+ 1GB Worker (tainted, tolerations required)
  • Architecture: All nodes run arm64 (64-bit OS)
  • Pi 3 node: Reserved for lightweight workloads only

Stack

Component Technology
K8s Distribution k0s
GitOps ArgoCD
Git Hosting Self-hosted Gitea/Forgejo
Monitoring Prometheus + Alertmanager + Grafana

CLI Tools Available

  • kubectl
  • argocd
  • k0sctl

Architecture

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│                     User Interface                          │
│              Terminal (CLI)  |  Dashboard (Web)             │
└─────────────────────┬───────────────────┬───────────────────┘
                      │                   │
┌─────────────────────▼───────────────────▼───────────────────┐
│                   Orchestrator Layer                         │
│                    k8s-orchestrator                          │
│         (Opus - complex reasoning, task delegation)          │
└─────────────────────┬───────────────────────────────────────┘
                      │ delegates to
┌─────────────────────▼───────────────────────────────────────┐
│                   Specialist Layer                           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
│  │k8s-         │ │argocd-      │ │prometheus-  │ │git-     ││
│  │diagnostician│ │operator     │ │analyst      │ │operator ││
│  │(Sonnet)     │ │(Sonnet)     │ │(Sonnet)     │ │(Sonnet) ││
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────┘
                      │ defined by
┌─────────────────────▼───────────────────────────────────────┐
│                   Workflow Layer                             │
│            YAML (complex)  |  Markdown (simple)              │
└─────────────────────────────────────────────────────────────┘

Directory Structure

~/.claude/
├── settings.json              # Agent definitions, autonomy rules
├── agents/
│   ├── k8s-orchestrator.md    # Orchestrator prompt
│   ├── k8s-diagnostician.md   # Cluster diagnostics specialist
│   ├── argocd-operator.md     # GitOps operations specialist
│   ├── prometheus-analyst.md  # Metrics analysis specialist
│   └── git-operator.md        # Git/Gitea operations specialist
├── workflows/
│   ├── health/
│   │   ├── cluster-health-check.yaml
│   │   └── node-pressure-response.yaml
│   ├── deploy/
│   │   ├── deploy-app.md
│   │   └── rollback-app.yaml
│   └── incidents/
│       └── pod-crashloop.yaml
├── skills/
│   ├── cluster-status.md
│   ├── deploy.md
│   ├── diagnose.md
│   ├── rollback.md
│   └── workflow.md
├── logs/
│   ├── actions/               # Action audit trail
│   └── workflows/             # Workflow execution logs
└── docs/plans/

Subagent Definitions

settings.json

{
  "agents": {
    "k8s-orchestrator": {
      "model": "opus",
      "promptFile": "agents/k8s-orchestrator.md"
    },
    "k8s-diagnostician": {
      "model": "sonnet",
      "promptFile": "agents/k8s-diagnostician.md"
    },
    "argocd-operator": {
      "model": "sonnet",
      "promptFile": "agents/argocd-operator.md"
    },
    "prometheus-analyst": {
      "model": "sonnet",
      "promptFile": "agents/prometheus-analyst.md"
    },
    "git-operator": {
      "model": "sonnet",
      "promptFile": "agents/git-operator.md"
    }
  },
  "autonomy": {
    "safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
    "confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
    "forbidden_actions": ["drain", "cordon", "delete node", "reset"]
  }
}

Subagent Responsibilities

Agent Scope Tools
k8s-orchestrator Task analysis, delegation, decision making All (via delegation)
k8s-diagnostician Cluster health, pod/node status, logs kubectl, log tools
argocd-operator App sync, deployments, rollbacks argocd CLI, kubectl
prometheus-analyst Metrics, alerts, trends PromQL, Prometheus API
git-operator Manifest commits, PRs, GitOps repo git, Gitea API

Model Assignment

Defaults

  • Orchestrator: Opus (complex reasoning, task delegation)
  • Subagents: Sonnet (standard operations)

Override Levels

  1. Per-workflow: Specify in workflow YAML
  2. Per-step: Specify for individual workflow steps
  3. Dynamic: Orchestrator selects based on task complexity

Dynamic Model Selection (Orchestrator Logic)

Task Complexity Model Examples
Simple Haiku Get status, list resources, log tail
Standard Sonnet Analyze logs, diagnose issues, sync apps
Complex Opus Root cause analysis, cascading failures, trade-off decisions

Delegation syntax:

Delegate to k8s-diagnostician (haiku):
  Task: Get current node status

Delegate to prometheus-analyst (sonnet):
  Task: Analyze memory trends for namespace "prod" over last 24h

Delegate to k8s-diagnostician (opus):
  Task: Investigate cascading failure across multiple services

Workflow Definitions

YAML Workflows (Complex)

name: cluster-health-check
description: Comprehensive cluster health assessment
model: sonnet  # optional default override
trigger:
  - schedule: "0 */6 * * *"  # every 6 hours
  - manual: true

steps:
  - agent: k8s-diagnostician
    model: haiku  # simple status check
    task: Check node status and resource pressure

  - agent: prometheus-analyst
    task: Query for anomalies in last 6 hours

  - agent: argocd-operator
    model: haiku
    task: Check all apps sync status

  - agent: k8s-orchestrator
    task: Summarize findings and recommend actions
    confirm_if: actions_proposed

Markdown Workflows (Simple)

# Deploy New App

When asked to deploy a new application:

1. Ask git-operator to create the manifest structure in the GitOps repo
2. Ask argocd-operator to create and sync the ArgoCD application
3. Ask k8s-diagnostician to verify pods are running
4. Report deployment status

Incident Response Workflow Example

name: pod-crashloop-remediation
trigger:
  type: alert
  match:
    alertname: KubePodCrashLooping

steps:
  - name: diagnose
    agent: k8s-diagnostician
    action: get-pod-status
    inputs:
      namespace: "{{ alert.labels.namespace }}"
      pod: "{{ alert.labels.pod }}"

  - name: check-logs
    agent: k8s-diagnostician
    action: analyze-logs
    inputs:
      pod: "{{ steps.diagnose.pod }}"
      lines: 100

  - name: decide-action
    condition: "{{ steps.check-logs.cause == 'oom' }}"
    branches:
      true:
        agent: argocd-operator
        action: update-resources
        confirm: true  # risky action
      false:
        agent: k8s-diagnostician
        action: restart-pod
        confirm: false  # safe action

  - name: notify
    action: report
    outputs:
      - summary
      - actions-taken

Autonomy Model

Tiered Autonomy

Action Type Behavior Examples
Safe Auto-execute, log action get, describe, logs, list, restart pod
Confirm Require user approval delete, patch, scale, apply, modify config
Forbidden Reject with explanation drain, cordon, delete node

Confirmation Flow

1. Agent proposes action with rationale
2. System checks action against autonomy rules
3. If safe → execute immediately, log action
4. If confirm → present to user (CLI prompt or dashboard queue)
5. If forbidden → reject with explanation

Per-Workflow Overrides

name: emergency-pod-restart
autonomy:
  auto_approve:
    - restart_pod
    - scale_replicas
  always_confirm:
    - delete_pvc

Action Logging

~/.claude/logs/actions/2025-12-26-actions.jsonl

Each entry includes:

  • Timestamp
  • Agent
  • Action
  • Inputs
  • Outcome
  • Approval type (auto/user-confirmed)

Skills (User Entry Points)

Skill Command Purpose
cluster-status /cluster-status Quick health overview
deploy /deploy <app> Deploy or update an app
diagnose /diagnose <issue> Investigate a problem
rollback /rollback <app> Revert to previous version
workflow /workflow <name> Run a named workflow

Example Skill: cluster-status.md

# Cluster Status

Invoke the k8s-orchestrator to provide a quick health overview.

## Steps
1. Delegate to k8s-diagnostician: get node status
2. Delegate to prometheus-analyst: check for active alerts
3. Delegate to argocd-operator: list out-of-sync apps
4. Summarize in a concise table

## Output Format
- Node health: table
- Active alerts: bullet list
- ArgoCD status: table
- Recommendations: if any issues found

Interaction Methods

Terminal/CLI

  • Primary interaction via Claude Code
  • Fallback when cluster is unavailable
  • Use skills to invoke workflows

Dashboard (Web UI)

  • Deployed on cluster (Pi 3 node)
  • Views: Status, Pending Confirmations, History, Workflows
  • Approve/reject risky actions

Push Notifications (Future)

  • Discord, Slack, or Telegram integration
  • Alert on issues requiring attention

Dashboard Specification

Tech Stack

  • Backend: Go binary (single static binary, embedded assets)
  • Storage: SQLite or flat JSON files
  • Resources: Minimal footprint for Pi 3

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-agent-dashboard
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: dashboard
          image: k8s-agent-dashboard:latest
          resources:
            requests:
              memory: "32Mi"
              cpu: "10m"
            limits:
              memory: "64Mi"
              cpu: "100m"
      tolerations:
        - key: "node-type"
          operator: "Equal"
          value: "pi3"
          effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/arch: arm64

Views

View Description
Status Current cluster health, active alerts, ArgoCD sync state
Pending Actions awaiting confirmation with approve/reject buttons
History Recent actions taken, filterable by agent/workflow
Workflows List of defined workflows, manual trigger capability

Implementation Phases

Phase 1: Core Agent System

Deliverables:

  • ~/.claude/ directory structure
  • Orchestrator and 4 subagent prompt files
  • settings.json with agent configurations
  • 3-4 essential workflows (cluster-health, deploy, diagnose)
  • Core skills (/cluster-status, /deploy, /diagnose)

Validation:

  • Manual CLI invocation
  • Test each subagent independently
  • Run health check workflow end-to-end

Phase 2: Dashboard

Deliverables:

  • Go-based dashboard application
  • Kubernetes manifests for Pi 3 deployment
  • Pending confirmations queue
  • Action history view
  • Approval flow integration

Phase 3: Automation

Deliverables:

  • Scheduled workflow execution
  • Alertmanager webhook integration
  • Expanded incident response workflows

Phase 4: Expansion (Future)

Potential additions:

  • Push notifications (Discord/Telegram)
  • Additional domains (development, research, productivity)
  • SDK-based background daemon for true autonomy

Future Domain Expansion

The system is designed to expand beyond DevOps:

Domain Use Cases
Software Development Code generation, refactoring, testing across repos
Research & Analysis Information gathering, summarizing, recommendations
Personal Productivity File management, notes, task tracking

New domains would add:

  • Additional subagents with specialized prompts
  • Domain-specific workflows
  • New skills for user invocation