feat: Implement Phase 3 automation for K8s agent system
Automation components for scheduled and event-driven workflows:
Scheduler:
- scheduler.sh for cron-based workflow execution
- Logs workflow runs to ~/.claude/logs/workflows/
- Notifies dashboard on completion
Alertmanager Integration:
- webhook-receiver.sh for processing alerts
- Dashboard endpoint /api/webhooks/alertmanager
- Example alertmanager-config.yaml with routing rules
- Maps alerts to workflows (crashloop, node issues, resources)
New Incident Workflows:
- node-issue-response.yaml: Handle NotReady/unreachable nodes
- resource-pressure-response.yaml: Respond to memory/CPU overcommit
- argocd-sync-failure.yaml: Investigate and fix sync failures
Dashboard Updates:
- POST /api/webhooks/alertmanager endpoint
- POST /api/workflows/{name}/complete endpoint
- Alerts create pending actions for visibility
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
110
workflows/incidents/argocd-sync-failure.yaml
Normal file
110
workflows/incidents/argocd-sync-failure.yaml
Normal file
@@ -0,0 +1,110 @@
|
||||
name: argocd-sync-failure
|
||||
description: Investigate and resolve ArgoCD sync failures
|
||||
version: "1.0"
|
||||
|
||||
trigger:
|
||||
- alert:
|
||||
match:
|
||||
alertname: ArgoCDAppOutOfSync
|
||||
- alert:
|
||||
match:
|
||||
alertname: ArgoCDAppSyncFailed
|
||||
- manual: true
|
||||
inputs:
|
||||
- name: app
|
||||
description: ArgoCD application name
|
||||
required: true
|
||||
|
||||
defaults:
|
||||
model: sonnet
|
||||
|
||||
steps:
|
||||
- name: get-app-status
|
||||
agent: argocd-operator
|
||||
model: haiku
|
||||
task: |
|
||||
Get detailed status of the application:
|
||||
- App name: {{ inputs.app | default(alert.labels.name) }}
|
||||
- Sync status and message
|
||||
- Health status
|
||||
- Last sync attempt and result
|
||||
- Current revision vs target revision
|
||||
output: app_status
|
||||
|
||||
- name: check-diff
|
||||
agent: argocd-operator
|
||||
model: sonnet
|
||||
task: |
|
||||
Analyze the diff between desired and live state:
|
||||
- Run argocd app diff
|
||||
- Identify what resources differ
|
||||
- Check for drift vs intentional changes
|
||||
|
||||
App: {{ steps.get-app-status.output.app_name }}
|
||||
output: diff_analysis
|
||||
|
||||
- name: check-git
|
||||
agent: git-operator
|
||||
model: haiku
|
||||
task: |
|
||||
Check the GitOps repo for recent changes:
|
||||
- Recent commits to the app path
|
||||
- Any open PRs affecting this app
|
||||
- Validate manifest syntax
|
||||
|
||||
App path: {{ steps.get-app-status.output.source_path }}
|
||||
output: git_status
|
||||
|
||||
- name: check-resources
|
||||
agent: k8s-diagnostician
|
||||
model: haiku
|
||||
task: |
|
||||
Check related Kubernetes resources:
|
||||
- Pod status in the app namespace
|
||||
- Any pending resources
|
||||
- Events related to the app
|
||||
|
||||
Namespace: {{ steps.get-app-status.output.namespace }}
|
||||
output: k8s_status
|
||||
|
||||
- name: diagnose-and-fix
|
||||
agent: k8s-orchestrator
|
||||
model: sonnet
|
||||
task: |
|
||||
Diagnose sync failure and recommend fix:
|
||||
|
||||
Evidence:
|
||||
- App status: {{ steps.get-app-status.output }}
|
||||
- Diff analysis: {{ steps.check-diff.output }}
|
||||
- Git status: {{ steps.check-git.output }}
|
||||
- K8s resources: {{ steps.check-resources.output }}
|
||||
|
||||
Common causes:
|
||||
1. Resource conflict (another controller managing resource)
|
||||
2. Invalid manifest (syntax or semantic error)
|
||||
3. Missing dependencies (CRDs, secrets, configmaps)
|
||||
4. Resource quota exceeded
|
||||
5. Image pull failures
|
||||
|
||||
Provide:
|
||||
- Root cause
|
||||
- Fix recommendation
|
||||
- Whether to retry sync or fix manifest first
|
||||
output: diagnosis
|
||||
|
||||
- name: attempt-resync
|
||||
condition: "{{ steps.diagnose-and-fix.output.should_retry }}"
|
||||
agent: argocd-operator
|
||||
model: haiku
|
||||
task: |
|
||||
Attempt to resync the application:
|
||||
- Refresh application state
|
||||
- If diagnosis suggests, run sync with --force
|
||||
|
||||
App: {{ steps.get-app-status.output.app_name }}
|
||||
output: resync_result
|
||||
confirm: true
|
||||
|
||||
outputs:
|
||||
- diagnosis
|
||||
- resync_result
|
||||
Reference in New Issue
Block a user