- Add .gitignore for logs, caches, credentials, and history - Add K8s agent orchestrator design document - Include existing Claude Code settings and plugin configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
86 lines
3.5 KiB
Markdown
86 lines
3.5 KiB
Markdown
# Agent Orchestrator System - Brainstorming Notes
|
|
|
|
## Overview
|
|
User-level Claude Code agent system with orchestrator + specialized subagents + workflows.
|
|
Location: `~/.claude/`
|
|
|
|
## Target Domains (for future expansion)
|
|
- **A) DevOps/Infrastructure** - PRIMARY - Raspberry Pi K8s cluster management
|
|
- B) Software development - Code generation, refactoring, testing
|
|
- C) Research & analysis - Information gathering, summarizing
|
|
- D) Personal productivity - Files, notes, tasks, schedules
|
|
- E) Multi-domain - General-purpose tasks
|
|
|
|
## Primary Use Case
|
|
- Raspberry Pi Kubernetes cluster management
|
|
- App deployment to the cluster
|
|
- K8s distribution: **k0s**
|
|
- Deployment method: **GitOps with ArgoCD**
|
|
|
|
## Cluster Hardware
|
|
| Node | Hardware | RAM | Role |
|
|
|------|----------|-----|------|
|
|
| Node 1 | Raspberry Pi 5 | 8GB | Control plane + Worker |
|
|
| Node 2 | Raspberry Pi 5 | 8GB | Worker |
|
|
| Node 3 | Raspberry Pi 3B+ | 1GB | Worker (tainted, tolerations required) |
|
|
|
|
**Pi 3 node**: Reserved for lightweight workloads only. Good candidate for dashboard deployment.
|
|
**Architecture**: All nodes run arm64 (64-bit OS).
|
|
|
|
## Workloads
|
|
- Self-hosted services (home automation, media, personal tools)
|
|
- Development/testing environments
|
|
- Infrastructure services (monitoring, logging, databases)
|
|
|
|
## Agent Tasks (priority order)
|
|
1. **Cluster health monitoring** - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY)
|
|
2. Deployment management - Create/update deployments, ArgoCD sync, rollbacks
|
|
3. Resource management - Scaling, allocation, cleanup
|
|
4. App lifecycle - End-to-end "I want to run X" to deployed
|
|
5. Incident response - Alerting, investigation, remediation
|
|
|
|
## Autonomy Model
|
|
- **Tiered autonomy**: Safe actions auto-apply, risky actions require confirmation
|
|
- Safe: restart pod, scale replicas, clear completed jobs
|
|
- Risky: delete PVC, modify configs, node operations
|
|
|
|
## Interaction Methods
|
|
- **Terminal/CLI** - Primary interaction via Claude Code (also fallback when cluster is down)
|
|
- **Dashboard/UI** - Web interface deployed on cluster via ArgoCD
|
|
- **Push notifications** - Future consideration (Discord/Slack/Telegram)
|
|
|
|
## Infrastructure Stack
|
|
- Monitoring: **Prometheus + Alertmanager + Grafana**
|
|
- GitOps repo: **Self-hosted Gitea/Forgejo**
|
|
- Workflow triggers: **Scheduled + Event-driven (Alertmanager webhooks)**
|
|
|
|
## Implementation Approach
|
|
**Phase 1**: Claude Code skills + custom subagent types in `~/.claude/`
|
|
**Phase 2 (later)**: Add SDK-based daemon for background automation
|
|
|
|
## Subagents
|
|
1. **k8s-diagnostician** - Cluster health, pod/node status, resource utilization, log analysis
|
|
2. **argocd-operator** - App sync, deployments, rollbacks, GitOps operations
|
|
3. **prometheus-analyst** - Query metrics, analyze trends, interpret alerts
|
|
4. **git-operator** - Commit manifests, create PRs in Gitea, manage GitOps repo
|
|
|
|
## Workflow Definitions
|
|
- **YAML** - Complex workflows with branching, conditions, multi-step
|
|
- **Markdown** - Simple workflows, prose-like descriptions
|
|
|
|
## CLI Tools Available
|
|
- kubectl
|
|
- argocd CLI
|
|
- k0sctl
|
|
|
|
## Model Assignment
|
|
- **Default**: Orchestrator = Opus, Subagents = Sonnet
|
|
- **Override levels**:
|
|
1. Per-workflow: specify model in workflow YAML
|
|
2. Per-step: specify model for individual workflow steps
|
|
3. Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity
|
|
- **Cost optimization**: Orchestrator evaluates task complexity and selects appropriate model
|
|
- Simple queries (get status, list) → Haiku
|
|
- Standard operations (analyze, diagnose) → Sonnet
|
|
- Complex reasoning (root cause, multi-factor decisions) → Opus
|