# Agent Orchestrator System - Brainstorming Notes ## Overview User-level Claude Code agent system with orchestrator + specialized subagents + workflows. Location: `~/.claude/` ## Target Domains (for future expansion) - **A) DevOps/Infrastructure** - PRIMARY - Raspberry Pi K8s cluster management - B) Software development - Code generation, refactoring, testing - C) Research & analysis - Information gathering, summarizing - D) Personal productivity - Files, notes, tasks, schedules - E) Multi-domain - General-purpose tasks ## Primary Use Case - Raspberry Pi Kubernetes cluster management - App deployment to the cluster - K8s distribution: **k0s** - Deployment method: **GitOps with ArgoCD** ## Cluster Hardware | Node | Hardware | RAM | Role | |------|----------|-----|------| | Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker | | Node 2 | Raspberry Pi 5 | 8GB | Worker | | Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) | **Pi 3 node**: Reserved for lightweight workloads only. Good candidate for dashboard deployment. **Architecture**: All nodes run arm64 (64-bit OS). ## Workloads - Self-hosted services (home automation, media, personal tools) - Development/testing environments - Infrastructure services (monitoring, logging, databases) ## Agent Tasks (priority order) 1. **Cluster health monitoring** - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY) 2. Deployment management - Create/update deployments, ArgoCD sync, rollbacks 3. Resource management - Scaling, allocation, cleanup 4. App lifecycle - End-to-end "I want to run X" to deployed 5. Incident response - Alerting, investigation, remediation ## Autonomy Model - **Tiered autonomy**: Safe actions auto-apply, risky actions require confirmation - Safe: restart pod, scale replicas, clear completed jobs - Risky: delete PVC, modify configs, node operations ## Interaction Methods - **Terminal/CLI** - Primary interaction via Claude Code (also fallback when cluster is down) - **Dashboard/UI** - Web interface deployed on cluster via ArgoCD - **Push notifications** - Future consideration (Discord/Slack/Telegram) ## Infrastructure Stack - Monitoring: **Prometheus + Alertmanager + Grafana** - GitOps repo: **Self-hosted Gitea/Forgejo** - Workflow triggers: **Scheduled + Event-driven (Alertmanager webhooks)** ## Implementation Approach **Phase 1**: Claude Code skills + custom subagent types in `~/.claude/` **Phase 2 (later)**: Add SDK-based daemon for background automation ## Subagents 1. **k8s-diagnostician** - Cluster health, pod/node status, resource utilization, log analysis 2. **argocd-operator** - App sync, deployments, rollbacks, GitOps operations 3. **prometheus-analyst** - Query metrics, analyze trends, interpret alerts 4. **git-operator** - Commit manifests, create PRs in Gitea, manage GitOps repo ## Workflow Definitions - **YAML** - Complex workflows with branching, conditions, multi-step - **Markdown** - Simple workflows, prose-like descriptions ## CLI Tools Available - kubectl - argocd CLI - k0sctl ## Model Assignment - **Default**: Orchestrator = Opus, Subagents = Sonnet - **Override levels**: 1. Per-workflow: specify model in workflow YAML 2. Per-step: specify model for individual workflow steps 3. Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity - **Cost optimization**: Orchestrator evaluates task complexity and selects appropriate model - Simple queries (get status, list) → Haiku - Standard operations (analyze, diagnose) → Sonnet - Complex reasoning (root cause, multi-factor decisions) → Opus