claude-code/docs/plans/2025-12-26-agent-orchestrator-brainstorm.md

# Agent Orchestrator System - Brainstorming Notes

## Overview
User-level Claude Code agent system with orchestrator + specialized subagents + workflows.
Location: `~/.claude/`

## Target Domains (for future expansion)
- **A) DevOps/Infrastructure** - PRIMARY - Raspberry Pi K8s cluster management
- B) Software development - Code generation, refactoring, testing
- C) Research & analysis - Information gathering, summarizing
- D) Personal productivity - Files, notes, tasks, schedules
- E) Multi-domain - General-purpose tasks

## Primary Use Case
- Raspberry Pi Kubernetes cluster management
- App deployment to the cluster
- K8s distribution: **k0s**
- Deployment method: **GitOps with ArgoCD**

## Cluster Hardware
| Node | Hardware | RAM | Role |
|------|----------|-----|------|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |

**Pi 3 node**: Reserved for lightweight workloads only. Good candidate for dashboard deployment.
**Architecture**: All nodes run arm64 (64-bit OS).

## Workloads
- Self-hosted services (home automation, media, personal tools)
- Development/testing environments
- Infrastructure services (monitoring, logging, databases)

## Agent Tasks (priority order)
1. **Cluster health monitoring** - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY)
2. Deployment management - Create/update deployments, ArgoCD sync, rollbacks
3. Resource management - Scaling, allocation, cleanup
4. App lifecycle - End-to-end "I want to run X" to deployed
5. Incident response - Alerting, investigation, remediation

## Autonomy Model
- **Tiered autonomy**: Safe actions auto-apply, risky actions require confirmation
- Safe: restart pod, scale replicas, clear completed jobs
- Risky: delete PVC, modify configs, node operations

## Interaction Methods
- **Terminal/CLI** - Primary interaction via Claude Code (also fallback when cluster is down)
- **Dashboard/UI** - Web interface deployed on cluster via ArgoCD
- **Push notifications** - Future consideration (Discord/Slack/Telegram)

## Infrastructure Stack
- Monitoring: **Prometheus + Alertmanager + Grafana**
- GitOps repo: **Self-hosted Gitea/Forgejo**
- Workflow triggers: **Scheduled + Event-driven (Alertmanager webhooks)**

## Implementation Approach
**Phase 1**: Claude Code skills + custom subagent types in `~/.claude/`
**Phase 2 (later)**: Add SDK-based daemon for background automation

## Subagents
1. **k8s-diagnostician** - Cluster health, pod/node status, resource utilization, log analysis
2. **argocd-operator** - App sync, deployments, rollbacks, GitOps operations
3. **prometheus-analyst** - Query metrics, analyze trends, interpret alerts
4. **git-operator** - Commit manifests, create PRs in Gitea, manage GitOps repo

## Workflow Definitions
- **YAML** - Complex workflows with branching, conditions, multi-step
- **Markdown** - Simple workflows, prose-like descriptions

## CLI Tools Available
- kubectl
- argocd CLI
- k0sctl

## Model Assignment
- **Default**: Orchestrator = Opus, Subagents = Sonnet
- **Override levels**:
  1. Per-workflow: specify model in workflow YAML
  2. Per-step: specify model for individual workflow steps
  3. Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity
- **Cost optimization**: Orchestrator evaluates task complexity and selects appropriate model
  - Simple queries (get status, list) → Haiku
  - Standard operations (analyze, diagnose) → Sonnet
  - Complex reasoning (root cause, multi-factor decisions) → Opus