Files
claude-code/workflows/incidents/argocd-sync-failure.yaml
OpenCode Test c14bae9a12 feat: Implement Phase 3 automation for K8s agent system
Automation components for scheduled and event-driven workflows:

Scheduler:
- scheduler.sh for cron-based workflow execution
- Logs workflow runs to ~/.claude/logs/workflows/
- Notifies dashboard on completion

Alertmanager Integration:
- webhook-receiver.sh for processing alerts
- Dashboard endpoint /api/webhooks/alertmanager
- Example alertmanager-config.yaml with routing rules
- Maps alerts to workflows (crashloop, node issues, resources)

New Incident Workflows:
- node-issue-response.yaml: Handle NotReady/unreachable nodes
- resource-pressure-response.yaml: Respond to memory/CPU overcommit
- argocd-sync-failure.yaml: Investigate and fix sync failures

Dashboard Updates:
- POST /api/webhooks/alertmanager endpoint
- POST /api/workflows/{name}/complete endpoint
- Alerts create pending actions for visibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-26 11:49:05 -08:00

111 lines
2.8 KiB
YAML

name: argocd-sync-failure
description: Investigate and resolve ArgoCD sync failures
version: "1.0"
trigger:
- alert:
match:
alertname: ArgoCDAppOutOfSync
- alert:
match:
alertname: ArgoCDAppSyncFailed
- manual: true
inputs:
- name: app
description: ArgoCD application name
required: true
defaults:
model: sonnet
steps:
- name: get-app-status
agent: argocd-operator
model: haiku
task: |
Get detailed status of the application:
- App name: {{ inputs.app | default(alert.labels.name) }}
- Sync status and message
- Health status
- Last sync attempt and result
- Current revision vs target revision
output: app_status
- name: check-diff
agent: argocd-operator
model: sonnet
task: |
Analyze the diff between desired and live state:
- Run argocd app diff
- Identify what resources differ
- Check for drift vs intentional changes
App: {{ steps.get-app-status.output.app_name }}
output: diff_analysis
- name: check-git
agent: git-operator
model: haiku
task: |
Check the GitOps repo for recent changes:
- Recent commits to the app path
- Any open PRs affecting this app
- Validate manifest syntax
App path: {{ steps.get-app-status.output.source_path }}
output: git_status
- name: check-resources
agent: k8s-diagnostician
model: haiku
task: |
Check related Kubernetes resources:
- Pod status in the app namespace
- Any pending resources
- Events related to the app
Namespace: {{ steps.get-app-status.output.namespace }}
output: k8s_status
- name: diagnose-and-fix
agent: k8s-orchestrator
model: sonnet
task: |
Diagnose sync failure and recommend fix:
Evidence:
- App status: {{ steps.get-app-status.output }}
- Diff analysis: {{ steps.check-diff.output }}
- Git status: {{ steps.check-git.output }}
- K8s resources: {{ steps.check-resources.output }}
Common causes:
1. Resource conflict (another controller managing resource)
2. Invalid manifest (syntax or semantic error)
3. Missing dependencies (CRDs, secrets, configmaps)
4. Resource quota exceeded
5. Image pull failures
Provide:
- Root cause
- Fix recommendation
- Whether to retry sync or fix manifest first
output: diagnosis
- name: attempt-resync
condition: "{{ steps.diagnose-and-fix.output.should_retry }}"
agent: argocd-operator
model: haiku
task: |
Attempt to resync the application:
- Refresh application state
- If diagnosis suggests, run sync with --force
App: {{ steps.get-app-status.output.app_name }}
output: resync_result
confirm: true
outputs:
- diagnosis
- resync_result