Files
claude-code/docs/plans/2025-12-26-agent-orchestrator-brainstorm.md
OpenCode Test 216a95cec4 Initial commit: Claude Code config and K8s agent orchestrator design
- Add .gitignore for logs, caches, credentials, and history
- Add K8s agent orchestrator design document
- Include existing Claude Code settings and plugin configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:16:07 -08:00

86 lines
3.5 KiB
Markdown

# Agent Orchestrator System - Brainstorming Notes
## Overview
User-level Claude Code agent system with orchestrator + specialized subagents + workflows.
Location: `~/.claude/`
## Target Domains (for future expansion)
- **A) DevOps/Infrastructure** - PRIMARY - Raspberry Pi K8s cluster management
- B) Software development - Code generation, refactoring, testing
- C) Research & analysis - Information gathering, summarizing
- D) Personal productivity - Files, notes, tasks, schedules
- E) Multi-domain - General-purpose tasks
## Primary Use Case
- Raspberry Pi Kubernetes cluster management
- App deployment to the cluster
- K8s distribution: **k0s**
- Deployment method: **GitOps with ArgoCD**
## Cluster Hardware
| Node | Hardware | RAM | Role |
|------|----------|-----|------|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
**Pi 3 node**: Reserved for lightweight workloads only. Good candidate for dashboard deployment.
**Architecture**: All nodes run arm64 (64-bit OS).
## Workloads
- Self-hosted services (home automation, media, personal tools)
- Development/testing environments
- Infrastructure services (monitoring, logging, databases)
## Agent Tasks (priority order)
1. **Cluster health monitoring** - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY)
2. Deployment management - Create/update deployments, ArgoCD sync, rollbacks
3. Resource management - Scaling, allocation, cleanup
4. App lifecycle - End-to-end "I want to run X" to deployed
5. Incident response - Alerting, investigation, remediation
## Autonomy Model
- **Tiered autonomy**: Safe actions auto-apply, risky actions require confirmation
- Safe: restart pod, scale replicas, clear completed jobs
- Risky: delete PVC, modify configs, node operations
## Interaction Methods
- **Terminal/CLI** - Primary interaction via Claude Code (also fallback when cluster is down)
- **Dashboard/UI** - Web interface deployed on cluster via ArgoCD
- **Push notifications** - Future consideration (Discord/Slack/Telegram)
## Infrastructure Stack
- Monitoring: **Prometheus + Alertmanager + Grafana**
- GitOps repo: **Self-hosted Gitea/Forgejo**
- Workflow triggers: **Scheduled + Event-driven (Alertmanager webhooks)**
## Implementation Approach
**Phase 1**: Claude Code skills + custom subagent types in `~/.claude/`
**Phase 2 (later)**: Add SDK-based daemon for background automation
## Subagents
1. **k8s-diagnostician** - Cluster health, pod/node status, resource utilization, log analysis
2. **argocd-operator** - App sync, deployments, rollbacks, GitOps operations
3. **prometheus-analyst** - Query metrics, analyze trends, interpret alerts
4. **git-operator** - Commit manifests, create PRs in Gitea, manage GitOps repo
## Workflow Definitions
- **YAML** - Complex workflows with branching, conditions, multi-step
- **Markdown** - Simple workflows, prose-like descriptions
## CLI Tools Available
- kubectl
- argocd CLI
- k0sctl
## Model Assignment
- **Default**: Orchestrator = Opus, Subagents = Sonnet
- **Override levels**:
1. Per-workflow: specify model in workflow YAML
2. Per-step: specify model for individual workflow steps
3. Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity
- **Cost optimization**: Orchestrator evaluates task complexity and selects appropriate model
- Simple queries (get status, list) → Haiku
- Standard operations (analyze, diagnose) → Sonnet
- Complex reasoning (root cause, multi-factor decisions) → Opus