name: argocd-sync-failure description: Investigate and resolve ArgoCD sync failures version: "1.0" trigger: - alert: match: alertname: ArgoCDAppOutOfSync - alert: match: alertname: ArgoCDAppSyncFailed - manual: true inputs: - name: app description: ArgoCD application name required: true defaults: model: sonnet steps: - name: get-app-status agent: argocd-operator model: haiku task: | Get detailed status of the application: - App name: {{ inputs.app | default(alert.labels.name) }} - Sync status and message - Health status - Last sync attempt and result - Current revision vs target revision output: app_status - name: check-diff agent: argocd-operator model: sonnet task: | Analyze the diff between desired and live state: - Run argocd app diff - Identify what resources differ - Check for drift vs intentional changes App: {{ steps.get-app-status.output.app_name }} output: diff_analysis - name: check-git agent: git-operator model: haiku task: | Check the GitOps repo for recent changes: - Recent commits to the app path - Any open PRs affecting this app - Validate manifest syntax App path: {{ steps.get-app-status.output.source_path }} output: git_status - name: check-resources agent: k8s-diagnostician model: haiku task: | Check related Kubernetes resources: - Pod status in the app namespace - Any pending resources - Events related to the app Namespace: {{ steps.get-app-status.output.namespace }} output: k8s_status - name: diagnose-and-fix agent: k8s-orchestrator model: sonnet task: | Diagnose sync failure and recommend fix: Evidence: - App status: {{ steps.get-app-status.output }} - Diff analysis: {{ steps.check-diff.output }} - Git status: {{ steps.check-git.output }} - K8s resources: {{ steps.check-resources.output }} Common causes: 1. Resource conflict (another controller managing resource) 2. Invalid manifest (syntax or semantic error) 3. Missing dependencies (CRDs, secrets, configmaps) 4. Resource quota exceeded 5. Image pull failures Provide: - Root cause - Fix recommendation - Whether to retry sync or fix manifest first output: diagnosis - name: attempt-resync condition: "{{ steps.diagnose-and-fix.output.should_retry }}" agent: argocd-operator model: haiku task: | Attempt to resync the application: - Refresh application state - If diagnosis suggests, run sync with --force App: {{ steps.get-app-status.output.app_name }} output: resync_result confirm: true outputs: - diagnosis - resync_result