Automation components for scheduled and event-driven workflows:
Scheduler:
- scheduler.sh for cron-based workflow execution
- Logs workflow runs to ~/.claude/logs/workflows/
- Notifies dashboard on completion
Alertmanager Integration:
- webhook-receiver.sh for processing alerts
- Dashboard endpoint /api/webhooks/alertmanager
- Example alertmanager-config.yaml with routing rules
- Maps alerts to workflows (crashloop, node issues, resources)
New Incident Workflows:
- node-issue-response.yaml: Handle NotReady/unreachable nodes
- resource-pressure-response.yaml: Respond to memory/CPU overcommit
- argocd-sync-failure.yaml: Investigate and fix sync failures
Dashboard Updates:
- POST /api/webhooks/alertmanager endpoint
- POST /api/workflows/{name}/complete endpoint
- Alerts create pending actions for visibility
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
111 lines
2.8 KiB
YAML
111 lines
2.8 KiB
YAML
name: argocd-sync-failure
|
|
description: Investigate and resolve ArgoCD sync failures
|
|
version: "1.0"
|
|
|
|
trigger:
|
|
- alert:
|
|
match:
|
|
alertname: ArgoCDAppOutOfSync
|
|
- alert:
|
|
match:
|
|
alertname: ArgoCDAppSyncFailed
|
|
- manual: true
|
|
inputs:
|
|
- name: app
|
|
description: ArgoCD application name
|
|
required: true
|
|
|
|
defaults:
|
|
model: sonnet
|
|
|
|
steps:
|
|
- name: get-app-status
|
|
agent: argocd-operator
|
|
model: haiku
|
|
task: |
|
|
Get detailed status of the application:
|
|
- App name: {{ inputs.app | default(alert.labels.name) }}
|
|
- Sync status and message
|
|
- Health status
|
|
- Last sync attempt and result
|
|
- Current revision vs target revision
|
|
output: app_status
|
|
|
|
- name: check-diff
|
|
agent: argocd-operator
|
|
model: sonnet
|
|
task: |
|
|
Analyze the diff between desired and live state:
|
|
- Run argocd app diff
|
|
- Identify what resources differ
|
|
- Check for drift vs intentional changes
|
|
|
|
App: {{ steps.get-app-status.output.app_name }}
|
|
output: diff_analysis
|
|
|
|
- name: check-git
|
|
agent: git-operator
|
|
model: haiku
|
|
task: |
|
|
Check the GitOps repo for recent changes:
|
|
- Recent commits to the app path
|
|
- Any open PRs affecting this app
|
|
- Validate manifest syntax
|
|
|
|
App path: {{ steps.get-app-status.output.source_path }}
|
|
output: git_status
|
|
|
|
- name: check-resources
|
|
agent: k8s-diagnostician
|
|
model: haiku
|
|
task: |
|
|
Check related Kubernetes resources:
|
|
- Pod status in the app namespace
|
|
- Any pending resources
|
|
- Events related to the app
|
|
|
|
Namespace: {{ steps.get-app-status.output.namespace }}
|
|
output: k8s_status
|
|
|
|
- name: diagnose-and-fix
|
|
agent: k8s-orchestrator
|
|
model: sonnet
|
|
task: |
|
|
Diagnose sync failure and recommend fix:
|
|
|
|
Evidence:
|
|
- App status: {{ steps.get-app-status.output }}
|
|
- Diff analysis: {{ steps.check-diff.output }}
|
|
- Git status: {{ steps.check-git.output }}
|
|
- K8s resources: {{ steps.check-resources.output }}
|
|
|
|
Common causes:
|
|
1. Resource conflict (another controller managing resource)
|
|
2. Invalid manifest (syntax or semantic error)
|
|
3. Missing dependencies (CRDs, secrets, configmaps)
|
|
4. Resource quota exceeded
|
|
5. Image pull failures
|
|
|
|
Provide:
|
|
- Root cause
|
|
- Fix recommendation
|
|
- Whether to retry sync or fix manifest first
|
|
output: diagnosis
|
|
|
|
- name: attempt-resync
|
|
condition: "{{ steps.diagnose-and-fix.output.should_retry }}"
|
|
agent: argocd-operator
|
|
model: haiku
|
|
task: |
|
|
Attempt to resync the application:
|
|
- Refresh application state
|
|
- If diagnosis suggests, run sync with --force
|
|
|
|
App: {{ steps.get-app-status.output.app_name }}
|
|
output: resync_result
|
|
confirm: true
|
|
|
|
outputs:
|
|
- diagnosis
|
|
- resync_result
|