- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.5 KiB
3.5 KiB
Agent Orchestrator System - Brainstorming Notes
Overview
User-level Claude Code agent system with orchestrator + specialized subagents + workflows.
Location: ~/.claude/
Target Domains (for future expansion)
- A) DevOps/Infrastructure - PRIMARY - Raspberry Pi K8s cluster management
- B) Software development - Code generation, refactoring, testing
- C) Research & analysis - Information gathering, summarizing
- D) Personal productivity - Files, notes, tasks, schedules
- E) Multi-domain - General-purpose tasks
Primary Use Case
- Raspberry Pi Kubernetes cluster management
- App deployment to the cluster
- K8s distribution: k0s
- Deployment method: GitOps with ArgoCD
Cluster Hardware
| Node | Hardware | RAM | Role |
|---|---|---|---|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
Pi 3 node: Reserved for lightweight workloads only. Good candidate for dashboard deployment. Architecture: All nodes run arm64 (64-bit OS).
Workloads
- Self-hosted services (home automation, media, personal tools)
- Development/testing environments
- Infrastructure services (monitoring, logging, databases)
Agent Tasks (priority order)
- Cluster health monitoring - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY)
- Deployment management - Create/update deployments, ArgoCD sync, rollbacks
- Resource management - Scaling, allocation, cleanup
- App lifecycle - End-to-end "I want to run X" to deployed
- Incident response - Alerting, investigation, remediation
Autonomy Model
- Tiered autonomy: Safe actions auto-apply, risky actions require confirmation
- Safe: restart pod, scale replicas, clear completed jobs
- Risky: delete PVC, modify configs, node operations
Interaction Methods
- Terminal/CLI - Primary interaction via Claude Code (also fallback when cluster is down)
- Dashboard/UI - Web interface deployed on cluster via ArgoCD
- Push notifications - Future consideration (Discord/Slack/Telegram)
Infrastructure Stack
- Monitoring: Prometheus + Alertmanager + Grafana
- GitOps repo: Self-hosted Gitea/Forgejo
- Workflow triggers: Scheduled + Event-driven (Alertmanager webhooks)
Implementation Approach
Phase 1: Claude Code skills + custom subagent types in ~/.claude/
Phase 2 (later): Add SDK-based daemon for background automation
Subagents
- k8s-diagnostician - Cluster health, pod/node status, resource utilization, log analysis
- argocd-operator - App sync, deployments, rollbacks, GitOps operations
- prometheus-analyst - Query metrics, analyze trends, interpret alerts
- git-operator - Commit manifests, create PRs in Gitea, manage GitOps repo
Workflow Definitions
- YAML - Complex workflows with branching, conditions, multi-step
- Markdown - Simple workflows, prose-like descriptions
CLI Tools Available
- kubectl
- argocd CLI
- k0sctl
Model Assignment
- Default: Orchestrator = Opus, Subagents = Sonnet
- Override levels:
- Per-workflow: specify model in workflow YAML
- Per-step: specify model for individual workflow steps
- Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity
- Cost optimization: Orchestrator evaluates task complexity and selects appropriate model
- Simple queries (get status, list) → Haiku
- Standard operations (analyze, diagnose) → Sonnet
- Complex reasoning (root cause, multi-factor decisions) → Opus