Initial commit: Claude Code config and K8s agent orchestrator design

- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:16:07 -08:00
commit 216a95cec4
9 changed files with 1116 additions and 0 deletions
--- a/docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
+++ b/docs/plans/2025-12-26-k8s-agent-orchestrator-design.md
@@ -0,0 +1,482 @@
+# K8s Agent Orchestrator System - Design Document
+
+## Overview
+
+A user-level Claude Code agent system for autonomous Kubernetes cluster management. The system consists of an orchestrator agent that delegates to specialized subagents, with workflow definitions for common operations and a tiered autonomy model.
+
+**Location**: `~/.claude/`
+**Primary Domain**: DevOps/Infrastructure
+**Target**: Raspberry Pi k0s cluster
+
+---
+
+## Cluster Environment
+
+### Hardware
+
+| Node | Hardware | RAM | Role |
+|------|----------|-----|------|
+| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
+| Node 2 | Raspberry Pi 5 | 8GB | Worker |
+| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
+
+- **Architecture**: All nodes run arm64 (64-bit OS)
+- **Pi 3 node**: Reserved for lightweight workloads only
+
+### Stack
+
+| Component | Technology |
+|-----------|------------|
+| K8s Distribution | k0s |
+| GitOps | ArgoCD |
+| Git Hosting | Self-hosted Gitea/Forgejo |
+| Monitoring | Prometheus + Alertmanager + Grafana |
+
+### CLI Tools Available
+
+- `kubectl`
+- `argocd`
+- `k0sctl`
+
+---
+
+## Architecture
+
+### Three-Layer Design
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     User Interface                          │
+│              Terminal (CLI)  |  Dashboard (Web)             │
+└─────────────────────┬───────────────────┬───────────────────┘
+                      │                   │
+┌─────────────────────▼───────────────────▼───────────────────┐
+│                   Orchestrator Layer                         │
+│                    k8s-orchestrator                          │
+│         (Opus - complex reasoning, task delegation)          │
+└─────────────────────┬───────────────────────────────────────┘
+                      │ delegates to
+┌─────────────────────▼───────────────────────────────────────┐
+│                   Specialist Layer                           │
+│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────┐│
+│  │k8s-         │ │argocd-      │ │prometheus-  │ │git-     ││
+│  │diagnostician│ │operator     │ │analyst      │ │operator ││
+│  │(Sonnet)     │ │(Sonnet)     │ │(Sonnet)     │ │(Sonnet) ││
+│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────┘│
+└─────────────────────────────────────────────────────────────┘
+                      │ defined by
+┌─────────────────────▼───────────────────────────────────────┐
+│                   Workflow Layer                             │
+│            YAML (complex)  |  Markdown (simple)              │
+└─────────────────────────────────────────────────────────────┘
+```
+
+### Directory Structure
+
+```
+~/.claude/
+├── settings.json              # Agent definitions, autonomy rules
+├── agents/
+│   ├── k8s-orchestrator.md    # Orchestrator prompt
+│   ├── k8s-diagnostician.md   # Cluster diagnostics specialist
+│   ├── argocd-operator.md     # GitOps operations specialist
+│   ├── prometheus-analyst.md  # Metrics analysis specialist
+│   └── git-operator.md        # Git/Gitea operations specialist
+├── workflows/
+│   ├── health/
+│   │   ├── cluster-health-check.yaml
+│   │   └── node-pressure-response.yaml
+│   ├── deploy/
+│   │   ├── deploy-app.md
+│   │   └── rollback-app.yaml
+│   └── incidents/
+│       └── pod-crashloop.yaml
+├── skills/
+│   ├── cluster-status.md
+│   ├── deploy.md
+│   ├── diagnose.md
+│   ├── rollback.md
+│   └── workflow.md
+├── logs/
+│   ├── actions/               # Action audit trail
+│   └── workflows/             # Workflow execution logs
+└── docs/plans/
+```
+
+---
+
+## Subagent Definitions
+
+### settings.json
+
+```json
+{
+  "agents": {
+    "k8s-orchestrator": {
+      "model": "opus",
+      "promptFile": "agents/k8s-orchestrator.md"
+    },
+    "k8s-diagnostician": {
+      "model": "sonnet",
+      "promptFile": "agents/k8s-diagnostician.md"
+    },
+    "argocd-operator": {
+      "model": "sonnet",
+      "promptFile": "agents/argocd-operator.md"
+    },
+    "prometheus-analyst": {
+      "model": "sonnet",
+      "promptFile": "agents/prometheus-analyst.md"
+    },
+    "git-operator": {
+      "model": "sonnet",
+      "promptFile": "agents/git-operator.md"
+    }
+  },
+  "autonomy": {
+    "safe_actions": ["get", "describe", "logs", "list", "top", "diff"],
+    "confirm_actions": ["delete", "patch", "edit", "scale", "rollout", "apply"],
+    "forbidden_actions": ["drain", "cordon", "delete node", "reset"]
+  }
+}
+```
+
+### Subagent Responsibilities
+
+| Agent | Scope | Tools |
+|-------|-------|-------|
+| **k8s-orchestrator** | Task analysis, delegation, decision making | All (via delegation) |
+| **k8s-diagnostician** | Cluster health, pod/node status, logs | kubectl, log tools |
+| **argocd-operator** | App sync, deployments, rollbacks | argocd CLI, kubectl |
+| **prometheus-analyst** | Metrics, alerts, trends | PromQL, Prometheus API |
+| **git-operator** | Manifest commits, PRs, GitOps repo | git, Gitea API |
+
+---
+
+## Model Assignment
+
+### Defaults
+
+- **Orchestrator**: Opus (complex reasoning, task delegation)
+- **Subagents**: Sonnet (standard operations)
+
+### Override Levels
+
+1. **Per-workflow**: Specify in workflow YAML
+2. **Per-step**: Specify for individual workflow steps
+3. **Dynamic**: Orchestrator selects based on task complexity
+
+### Dynamic Model Selection (Orchestrator Logic)
+
+| Task Complexity | Model | Examples |
+|-----------------|-------|----------|
+| Simple | Haiku | Get status, list resources, log tail |
+| Standard | Sonnet | Analyze logs, diagnose issues, sync apps |
+| Complex | Opus | Root cause analysis, cascading failures, trade-off decisions |
+
+**Delegation syntax:**
+```markdown
+Delegate to k8s-diagnostician (haiku):
+  Task: Get current node status
+
+Delegate to prometheus-analyst (sonnet):
+  Task: Analyze memory trends for namespace "prod" over last 24h
+
+Delegate to k8s-diagnostician (opus):
+  Task: Investigate cascading failure across multiple services
+```
+
+---
+
+## Workflow Definitions
+
+### YAML Workflows (Complex)
+
+```yaml
+name: cluster-health-check
+description: Comprehensive cluster health assessment
+model: sonnet  # optional default override
+trigger:
+  - schedule: "0 */6 * * *"  # every 6 hours
+  - manual: true
+
+steps:
+  - agent: k8s-diagnostician
+    model: haiku  # simple status check
+    task: Check node status and resource pressure
+
+  - agent: prometheus-analyst
+    task: Query for anomalies in last 6 hours
+
+  - agent: argocd-operator
+    model: haiku
+    task: Check all apps sync status
+
+  - agent: k8s-orchestrator
+    task: Summarize findings and recommend actions
+    confirm_if: actions_proposed
+```
+
+### Markdown Workflows (Simple)
+
+```markdown
+# Deploy New App
+
+When asked to deploy a new application:
+
+1. Ask git-operator to create the manifest structure in the GitOps repo
+2. Ask argocd-operator to create and sync the ArgoCD application
+3. Ask k8s-diagnostician to verify pods are running
+4. Report deployment status
+```
+
+### Incident Response Workflow Example
+
+```yaml
+name: pod-crashloop-remediation
+trigger:
+  type: alert
+  match:
+    alertname: KubePodCrashLooping
+
+steps:
+  - name: diagnose
+    agent: k8s-diagnostician
+    action: get-pod-status
+    inputs:
+      namespace: "{{ alert.labels.namespace }}"
+      pod: "{{ alert.labels.pod }}"
+
+  - name: check-logs
+    agent: k8s-diagnostician
+    action: analyze-logs
+    inputs:
+      pod: "{{ steps.diagnose.pod }}"
+      lines: 100
+
+  - name: decide-action
+    condition: "{{ steps.check-logs.cause == 'oom' }}"
+    branches:
+      true:
+        agent: argocd-operator
+        action: update-resources
+        confirm: true  # risky action
+      false:
+        agent: k8s-diagnostician
+        action: restart-pod
+        confirm: false  # safe action
+
+  - name: notify
+    action: report
+    outputs:
+      - summary
+      - actions-taken
+```
+
+---
+
+## Autonomy Model
+
+### Tiered Autonomy
+
+| Action Type | Behavior | Examples |
+|-------------|----------|----------|
+| **Safe** | Auto-execute, log action | get, describe, logs, list, restart pod |
+| **Confirm** | Require user approval | delete, patch, scale, apply, modify config |
+| **Forbidden** | Reject with explanation | drain, cordon, delete node |
+
+### Confirmation Flow
+
+```
+1. Agent proposes action with rationale
+2. System checks action against autonomy rules
+3. If safe → execute immediately, log action
+4. If confirm → present to user (CLI prompt or dashboard queue)
+5. If forbidden → reject with explanation
+```
+
+### Per-Workflow Overrides
+
+```yaml
+name: emergency-pod-restart
+autonomy:
+  auto_approve:
+    - restart_pod
+    - scale_replicas
+  always_confirm:
+    - delete_pvc
+```
+
+### Action Logging
+
+```
+~/.claude/logs/actions/2025-12-26-actions.jsonl
+```
+
+Each entry includes:
+- Timestamp
+- Agent
+- Action
+- Inputs
+- Outcome
+- Approval type (auto/user-confirmed)
+
+---
+
+## Skills (User Entry Points)
+
+| Skill | Command | Purpose |
+|-------|---------|---------|
+| cluster-status | `/cluster-status` | Quick health overview |
+| deploy | `/deploy <app>` | Deploy or update an app |
+| diagnose | `/diagnose <issue>` | Investigate a problem |
+| rollback | `/rollback <app>` | Revert to previous version |
+| workflow | `/workflow <name>` | Run a named workflow |
+
+### Example Skill: cluster-status.md
+
+```markdown
+# Cluster Status
+
+Invoke the k8s-orchestrator to provide a quick health overview.
+
+## Steps
+1. Delegate to k8s-diagnostician: get node status
+2. Delegate to prometheus-analyst: check for active alerts
+3. Delegate to argocd-operator: list out-of-sync apps
+4. Summarize in a concise table
+
+## Output Format
+- Node health: table
+- Active alerts: bullet list
+- ArgoCD status: table
+- Recommendations: if any issues found
+```
+
+---
+
+## Interaction Methods
+
+### Terminal/CLI
+
+- Primary interaction via Claude Code
+- Fallback when cluster is unavailable
+- Use skills to invoke workflows
+
+### Dashboard (Web UI)
+
+- Deployed on cluster (Pi 3 node)
+- Views: Status, Pending Confirmations, History, Workflows
+- Approve/reject risky actions
+
+### Push Notifications (Future)
+
+- Discord, Slack, or Telegram integration
+- Alert on issues requiring attention
+
+---
+
+## Dashboard Specification
+
+### Tech Stack
+
+- **Backend**: Go binary (single static binary, embedded assets)
+- **Storage**: SQLite or flat JSON files
+- **Resources**: Minimal footprint for Pi 3
+
+### Deployment
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: k8s-agent-dashboard
+spec:
+  replicas: 1
+  template:
+    spec:
+      containers:
+        - name: dashboard
+          image: k8s-agent-dashboard:latest
+          resources:
+            requests:
+              memory: "32Mi"
+              cpu: "10m"
+            limits:
+              memory: "64Mi"
+              cpu: "100m"
+      tolerations:
+        - key: "node-type"
+          operator: "Equal"
+          value: "pi3"
+          effect: "NoSchedule"
+      nodeSelector:
+        kubernetes.io/arch: arm64
+```
+
+### Views
+
+| View | Description |
+|------|-------------|
+| Status | Current cluster health, active alerts, ArgoCD sync state |
+| Pending | Actions awaiting confirmation with approve/reject buttons |
+| History | Recent actions taken, filterable by agent/workflow |
+| Workflows | List of defined workflows, manual trigger capability |
+
+---
+
+## Implementation Phases
+
+### Phase 1: Core Agent System
+
+**Deliverables:**
+- `~/.claude/` directory structure
+- Orchestrator and 4 subagent prompt files
+- `settings.json` with agent configurations
+- 3-4 essential workflows (cluster-health, deploy, diagnose)
+- Core skills (/cluster-status, /deploy, /diagnose)
+
+**Validation:**
+- Manual CLI invocation
+- Test each subagent independently
+- Run health check workflow end-to-end
+
+### Phase 2: Dashboard
+
+**Deliverables:**
+- Go-based dashboard application
+- Kubernetes manifests for Pi 3 deployment
+- Pending confirmations queue
+- Action history view
+- Approval flow integration
+
+### Phase 3: Automation
+
+**Deliverables:**
+- Scheduled workflow execution
+- Alertmanager webhook integration
+- Expanded incident response workflows
+
+### Phase 4: Expansion (Future)
+
+**Potential additions:**
+- Push notifications (Discord/Telegram)
+- Additional domains (development, research, productivity)
+- SDK-based background daemon for true autonomy
+
+---
+
+## Future Domain Expansion
+
+The system is designed to expand beyond DevOps:
+
+| Domain | Use Cases |
+|--------|-----------|
+| Software Development | Code generation, refactoring, testing across repos |
+| Research & Analysis | Information gathering, summarizing, recommendations |
+| Personal Productivity | File management, notes, task tracking |
+
+New domains would add:
+- Additional subagents with specialized prompts
+- Domain-specific workflows
+- New skills for user invocation