Files

OpenCode Test 216a95cec4 Initial commit: Claude Code config and K8s agent orchestrator design

- Add .gitignore for logs, caches, credentials, and history
- Add K8s agent orchestrator design document
- Include existing Claude Code settings and plugin configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-26 11:16:07 -08:00

14 KiB

Raw Blame History

K8s Agent Orchestrator System - Design Document

Overview

A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.

Location: ~/.claude/ Primary Domain: DevOps/Infrastructure Target: Raspberry Pi k0s cluster

Cluster Environment

Hardware

Node	Hardware	RAM	Role
Node 1	Raspberry Pi 5	8GB	Control plane + Worker
Node 2	Raspberry Pi 5	8GB	Worker
Node 3	Raspberry Pi 3B+	1GB	Worker (tainted, tolerations required)

Architecture: All nodes run arm64 (64-bit OS)
Pi 3 node: Reserved for lightweight workloads only

Stack

Component	Technology
K8s Distribution	k0s
GitOps	ArgoCD
Git Hosting	Self-hosted Gitea/Forgejo
Monitoring	Prometheus + Alertmanager + Grafana

CLI Tools Available

kubectl
argocd
k0sctl

Architecture

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│                     User Interface                          │
│              Terminal (CLI)  |  Dashboard (Web)             │
└─────────────────────┬───────────────────┬───────────────────┘
                      │                   │
┌─────────────────────▼───────────────────▼───────────────────┐
│                   Orchestrator Layer                         │
│                    k8s-orchestrator                          │
│         (Opus - complex reasoning, task delegation)          │
└─────────────────────┬───────────────────────────────────────┘
                      │ delegates to
┌─────────────────────▼───────────────────────────────────────┐
│                   Specialist Layer                           │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
│  │k8s-         │ │argocd-      │ │prometheus-  │ │git-     ││
│  │diagnostician│ │operator     │ │analyst      │ │operator ││
│  │(Sonnet)     │ │(Sonnet)     │ │(Sonnet)     │ │(Sonnet) ││
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────┘
                      │ defined by
┌─────────────────────▼───────────────────────────────────────┐
│                   Workflow Layer                             │
│            YAML (complex)  |  Markdown (simple)              │
└─────────────────────────────────────────────────────────────┘

Directory Structure

~/.claude/
├── settings.json              # Agent definitions, autonomy rules
├── agents/
│   ├── k8s-orchestrator.md    # Orchestrator prompt
│   ├── k8s-diagnostician.md   # Cluster diagnostics specialist
│   ├── argocd-operator.md     # GitOps operations specialist
│   ├── prometheus-analyst.md  # Metrics analysis specialist
│   └── git-operator.md        # Git/Gitea operations specialist
├── workflows/
│   ├── health/
│   │   ├── cluster-health-check.yaml
│   │   └── node-pressure-response.yaml
│   ├── deploy/
│   │   ├── deploy-app.md
│   │   └── rollback-app.yaml
│   └── incidents/
│       └── pod-crashloop.yaml
├── skills/
│   ├── cluster-status.md
│   ├── deploy.md
│   ├── diagnose.md
│   ├── rollback.md
│   └── workflow.md
├── logs/
│   ├── actions/               # Action audit trail
│   └── workflows/             # Workflow execution logs
└── docs/plans/

Subagent Definitions

settings.json

{
  "agents": {
    "k8s-orchestrator": {
      "model": "opus",
      "promptFile": "agents/k8s-orchestrator.md"
    },
    "k8s-diagnostician": {
      "model": "sonnet",
      "promptFile": "agents/k8s-diagnostician.md"
    },
    "argocd-operator": {
      "model": "sonnet",
      "promptFile": "agents/argocd-operator.md"
    },
    "prometheus-analyst": {
      "model": "sonnet",
      "promptFile": "agents/prometheus-analyst.md"
    },
    "git-operator": {
      "model": "sonnet",
      "promptFile": "agents/git-operator.md"
    }
  },
  "autonomy": {
    "safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
    "confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
    "forbidden_actions": ["drain", "cordon", "delete node", "reset"]
  }
}

Subagent Responsibilities

Agent	Scope	Tools
k8s-orchestrator	Task analysis, delegation, decision making	All (via delegation)
k8s-diagnostician	Cluster health, pod/node status, logs	kubectl, log tools
argocd-operator	App sync, deployments, rollbacks	argocd CLI, kubectl
prometheus-analyst	Metrics, alerts, trends	PromQL, Prometheus API
git-operator	Manifest commits, PRs, GitOps repo	git, Gitea API

Model Assignment

Defaults

Orchestrator: Opus (complex reasoning, task delegation)
Subagents: Sonnet (standard operations)

Override Levels

Per-workflow: Specify in workflow YAML
Per-step: Specify for individual workflow steps
Dynamic: Orchestrator selects based on task complexity

Dynamic Model Selection (Orchestrator Logic)

Task Complexity	Model	Examples
Simple	Haiku	Get status, list resources, log tail
Standard	Sonnet	Analyze logs, diagnose issues, sync apps
Complex	Opus	Root cause analysis, cascading failures, trade-off decisions

Delegation syntax:

Delegate to k8s-diagnostician (haiku):
  Task: Get current node status

Delegate to prometheus-analyst (sonnet):
  Task: Analyze memory trends for namespace "prod" over last 24h

Delegate to k8s-diagnostician (opus):
  Task: Investigate cascading failure across multiple services

Workflow Definitions

YAML Workflows (Complex)

name: cluster-health-check
description: Comprehensive cluster health assessment
model: sonnet  # optional default override
trigger:
  - schedule: "0 */6 * * *"  # every 6 hours
  - manual: true

steps:
  - agent: k8s-diagnostician
    model: haiku  # simple status check
    task: Check node status and resource pressure

  - agent: prometheus-analyst
    task: Query for anomalies in last 6 hours

  - agent: argocd-operator
    model: haiku
    task: Check all apps sync status

  - agent: k8s-orchestrator
    task: Summarize findings and recommend actions
    confirm_if: actions_proposed

Markdown Workflows (Simple)

# Deploy New App

When asked to deploy a new application:

1. Ask git-operator to create the manifest structure in the GitOps repo
2. Ask argocd-operator to create and sync the ArgoCD application
3. Ask k8s-diagnostician to verify pods are running
4. Report deployment status

Incident Response Workflow Example

name: pod-crashloop-remediation
trigger:
  type: alert
  match:
    alertname: KubePodCrashLooping

steps:
  - name: diagnose
    agent: k8s-diagnostician
    action: get-pod-status
    inputs:
      namespace: "{{ alert.labels.namespace }}"
      pod: "{{ alert.labels.pod }}"

  - name: check-logs
    agent: k8s-diagnostician
    action: analyze-logs
    inputs:
      pod: "{{ steps.diagnose.pod }}"
      lines: 100

  - name: decide-action
    condition: "{{ steps.check-logs.cause == 'oom' }}"
    branches:
      true:
        agent: argocd-operator
        action: update-resources
        confirm: true  # risky action
      false:
        agent: k8s-diagnostician
        action: restart-pod
        confirm: false  # safe action

  - name: notify
    action: report
    outputs:
      - summary
      - actions-taken

Autonomy Model

Tiered Autonomy

Action Type	Behavior	Examples
Safe	Auto-execute, log action	get, describe, logs, list, restart pod
Confirm	Require user approval	delete, patch, scale, apply, modify config
Forbidden	Reject with explanation	drain, cordon, delete node

Confirmation Flow

1. Agent proposes action with rationale
2. System checks action against autonomy rules
3. If safe → execute immediately, log action
4. If confirm → present to user (CLI prompt or dashboard queue)
5. If forbidden → reject with explanation

Per-Workflow Overrides

name: emergency-pod-restart
autonomy:
  auto_approve:
    - restart_pod
    - scale_replicas
  always_confirm:
    - delete_pvc

Action Logging

~/.claude/logs/actions/2025-12-26-actions.jsonl

Each entry includes:

Timestamp
Agent
Action
Inputs
Outcome
Approval type (auto/user-confirmed)

Skills (User Entry Points)

Skill	Command	Purpose
cluster-status	`/cluster-status`	Quick health overview
deploy	`/deploy <app>`	Deploy or update an app
diagnose	`/diagnose <issue>`	Investigate a problem
rollback	`/rollback <app>`	Revert to previous version
workflow	`/workflow <name>`	Run a named workflow

Example Skill: cluster-status.md

# Cluster Status

Invoke the k8s-orchestrator to provide a quick health overview.

## Steps
1. Delegate to k8s-diagnostician: get node status
2. Delegate to prometheus-analyst: check for active alerts
3. Delegate to argocd-operator: list out-of-sync apps
4. Summarize in a concise table

## Output Format
- Node health: table
- Active alerts: bullet list
- ArgoCD status: table
- Recommendations: if any issues found

Interaction Methods

Terminal/CLI

Primary interaction via Claude Code
Fallback when cluster is unavailable
Use skills to invoke workflows

Dashboard (Web UI)

Deployed on cluster (Pi 3 node)
Views: Status, Pending Confirmations, History, Workflows
Approve/reject risky actions

Push Notifications (Future)

Discord, Slack, or Telegram integration
Alert on issues requiring attention

Dashboard Specification

Tech Stack

Backend: Go binary (single static binary, embedded assets)
Storage: SQLite or flat JSON files
Resources: Minimal footprint for Pi 3

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-agent-dashboard
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: dashboard
          image: k8s-agent-dashboard:latest
          resources:
            requests:
              memory: "32Mi"
              cpu: "10m"
            limits:
              memory: "64Mi"
              cpu: "100m"
      tolerations:
        - key: "node-type"
          operator: "Equal"
          value: "pi3"
          effect: "NoSchedule"
      nodeSelector:
        kubernetes.io/arch: arm64

Views

View	Description
Status	Current cluster health, active alerts, ArgoCD sync state
Pending	Actions awaiting confirmation with approve/reject buttons
History	Recent actions taken, filterable by agent/workflow
Workflows	List of defined workflows, manual trigger capability

Implementation Phases

Phase 1: Core Agent System

Deliverables:

~/.claude/ directory structure
Orchestrator and 4 subagent prompt files
settings.json with agent configurations
3-4 essential workflows (cluster-health, deploy, diagnose)
Core skills (/cluster-status, /deploy, /diagnose)

Validation:

Manual CLI invocation
Test each subagent independently
Run health check workflow end-to-end

Phase 2: Dashboard

Deliverables:

Go-based dashboard application
Kubernetes manifests for Pi 3 deployment
Pending confirmations queue
Action history view
Approval flow integration

Phase 3: Automation

Deliverables:

Scheduled workflow execution
Alertmanager webhook integration
Expanded incident response workflows

Phase 4: Expansion (Future)

Potential additions:

Push notifications (Discord/Telegram)
Additional domains (development, research, productivity)
SDK-based background daemon for true autonomy

Future Domain Expansion

The system is designed to expand beyond DevOps:

Domain	Use Cases
Software Development	Code generation, refactoring, testing across repos
Research & Analysis	Information gathering, summarizing, recommendations
Personal Productivity	File management, notes, task tracking

New domains would add:

Additional subagents with specialized prompts
Domain-specific workflows
New skills for user invocation

14 KiB Raw Blame History

K8s Agent Orchestrator System - Design Document

Overview

Cluster Environment

Hardware

Stack

CLI Tools Available

Architecture

Three-Layer Design

Directory Structure

Subagent Definitions

settings.json

Subagent Responsibilities

Model Assignment

Defaults

Override Levels

Dynamic Model Selection (Orchestrator Logic)

Workflow Definitions

YAML Workflows (Complex)

Markdown Workflows (Simple)

Incident Response Workflow Example

Autonomy Model

Tiered Autonomy

Confirmation Flow

Per-Workflow Overrides

Action Logging

Skills (User Entry Points)

Example Skill: cluster-status.md

Interaction Methods

Terminal/CLI

Dashboard (Web UI)

Push Notifications (Future)

Dashboard Specification

Tech Stack

Deployment

Views

Implementation Phases

Phase 1: Core Agent System

Phase 2: Dashboard

Phase 3: Automation

Phase 4: Expansion (Future)

Future Domain Expansion

14 KiB

Raw Blame History