Files
claude-code/docs/plans/2025-12-26-agent-orchestrator-brainstorm.md
OpenCode Test 216a95cec4 Initial commit: Claude Code config and K8s agent orchestrator design
- Add .gitignore for logs, caches, credentials, and history
- Add K8s agent orchestrator design document
- Include existing Claude Code settings and plugin configs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:16:07 -08:00

3.5 KiB

Agent Orchestrator System - Brainstorming Notes

Overview

User-level Claude Code agent system with orchestrator + specialized subagents + workflows. Location: ~/.claude/

Target Domains (for future expansion)

  • A) DevOps/Infrastructure - PRIMARY - Raspberry Pi K8s cluster management
  • B) Software development - Code generation, refactoring, testing
  • C) Research & analysis - Information gathering, summarizing
  • D) Personal productivity - Files, notes, tasks, schedules
  • E) Multi-domain - General-purpose tasks

Primary Use Case

  • Raspberry Pi Kubernetes cluster management
  • App deployment to the cluster
  • K8s distribution: k0s
  • Deployment method: GitOps with ArgoCD

Cluster Hardware

Node Hardware RAM Role
Node 1 Raspberry Pi 5 8GB Control plane + Worker
Node 2 Raspberry Pi 5 8GB Worker
Node 3 Raspberry Pi 3B+ 1GB Worker (tainted, tolerations required)

Pi 3 node: Reserved for lightweight workloads only. Good candidate for dashboard deployment. Architecture: All nodes run arm64 (64-bit OS).

Workloads

  • Self-hosted services (home automation, media, personal tools)
  • Development/testing environments
  • Infrastructure services (monitoring, logging, databases)

Agent Tasks (priority order)

  1. Cluster health monitoring - Detect issues, diagnose, suggest/apply fixes (TOP PRIORITY)
  2. Deployment management - Create/update deployments, ArgoCD sync, rollbacks
  3. Resource management - Scaling, allocation, cleanup
  4. App lifecycle - End-to-end "I want to run X" to deployed
  5. Incident response - Alerting, investigation, remediation

Autonomy Model

  • Tiered autonomy: Safe actions auto-apply, risky actions require confirmation
  • Safe: restart pod, scale replicas, clear completed jobs
  • Risky: delete PVC, modify configs, node operations

Interaction Methods

  • Terminal/CLI - Primary interaction via Claude Code (also fallback when cluster is down)
  • Dashboard/UI - Web interface deployed on cluster via ArgoCD
  • Push notifications - Future consideration (Discord/Slack/Telegram)

Infrastructure Stack

  • Monitoring: Prometheus + Alertmanager + Grafana
  • GitOps repo: Self-hosted Gitea/Forgejo
  • Workflow triggers: Scheduled + Event-driven (Alertmanager webhooks)

Implementation Approach

Phase 1: Claude Code skills + custom subagent types in ~/.claude/ Phase 2 (later): Add SDK-based daemon for background automation

Subagents

  1. k8s-diagnostician - Cluster health, pod/node status, resource utilization, log analysis
  2. argocd-operator - App sync, deployments, rollbacks, GitOps operations
  3. prometheus-analyst - Query metrics, analyze trends, interpret alerts
  4. git-operator - Commit manifests, create PRs in Gitea, manage GitOps repo

Workflow Definitions

  • YAML - Complex workflows with branching, conditions, multi-step
  • Markdown - Simple workflows, prose-like descriptions

CLI Tools Available

  • kubectl
  • argocd CLI
  • k0sctl

Model Assignment

  • Default: Orchestrator = Opus, Subagents = Sonnet
  • Override levels:
    1. Per-workflow: specify model in workflow YAML
    2. Per-step: specify model for individual workflow steps
    3. Dynamic: Orchestrator can downgrade/upgrade model per-delegation based on task complexity
  • Cost optimization: Orchestrator evaluates task complexity and selects appropriate model
    • Simple queries (get status, list) → Haiku
    • Standard operations (analyze, diagnose) → Sonnet
    • Complex reasoning (root cause, multi-factor decisions) → Opus