543 lines
12 KiB
Markdown
543 lines
12 KiB
Markdown
---
|
|
name: kubernetes
|
|
description: |
|
|
Comprehensive Kubernetes and OpenShift cluster management skill covering operations, troubleshooting, manifest generation, security, and GitOps. Use this skill when:
|
|
(1) Cluster operations: upgrades, backups, node management, scaling, monitoring setup
|
|
(2) Troubleshooting: pod failures, networking issues, storage problems, performance analysis
|
|
(3) Creating manifests: Deployments, StatefulSets, Services, Ingress, NetworkPolicies, RBAC
|
|
(4) Security: audits, Pod Security Standards, RBAC, secrets management, vulnerability scanning
|
|
(5) GitOps: ArgoCD, Flux, Kustomize, Helm, CI/CD pipelines, progressive delivery
|
|
(6) OpenShift-specific: SCCs, Routes, Operators, Builds, ImageStreams
|
|
(7) Multi-cloud: AKS, EKS, GKE, ARO, ROSA operations
|
|
metadata:
|
|
author: cluster-skills
|
|
version: "1.0.0"
|
|
---
|
|
|
|
# Kubernetes & OpenShift Cluster Management
|
|
|
|
Comprehensive skill for Kubernetes and OpenShift clusters covering operations, troubleshooting, manifests, security, and GitOps.
|
|
|
|
## Current Versions (January 2026)
|
|
|
|
| Platform | Version | Documentation |
|
|
|----------|---------|---------------|
|
|
| **Kubernetes** | 1.31.x | https://kubernetes.io/docs/ |
|
|
| **OpenShift** | 4.17.x | https://docs.openshift.com/ |
|
|
| **EKS** | 1.31 | https://docs.aws.amazon.com/eks/ |
|
|
| **AKS** | 1.31 | https://learn.microsoft.com/azure/aks/ |
|
|
| **GKE** | 1.31 | https://cloud.google.com/kubernetes-engine/docs |
|
|
|
|
### Key Tools
|
|
|
|
| Tool | Version | Purpose |
|
|
|------|---------|---------|
|
|
| **ArgoCD** | v2.13.x | GitOps deployments |
|
|
| **Flux** | v2.4.x | GitOps toolkit |
|
|
| **Kustomize** | v5.5.x | Manifest customization |
|
|
| **Helm** | v3.16.x | Package management |
|
|
| **Velero** | 1.15.x | Backup/restore |
|
|
| **Trivy** | 0.58.x | Security scanning |
|
|
| **Kyverno** | 1.13.x | Policy engine |
|
|
|
|
## Command Convention
|
|
|
|
**IMPORTANT**: Use `kubectl` for standard Kubernetes. Use `oc` for OpenShift/ARO.
|
|
|
|
---
|
|
|
|
## 1. CLUSTER OPERATIONS
|
|
|
|
### Node Management
|
|
|
|
```bash
|
|
# View nodes
|
|
kubectl get nodes -o wide
|
|
|
|
# Drain node for maintenance
|
|
kubectl drain ${NODE} --ignore-daemonsets --delete-emptydir-data --grace-period=60
|
|
|
|
# Uncordon after maintenance
|
|
kubectl uncordon ${NODE}
|
|
|
|
# View node resources
|
|
kubectl top nodes
|
|
```
|
|
|
|
### Cluster Upgrades
|
|
|
|
**AKS:**
|
|
```bash
|
|
az aks get-upgrades -g ${RG} -n ${CLUSTER} -o table
|
|
az aks upgrade -g ${RG} -n ${CLUSTER} --kubernetes-version ${VERSION}
|
|
```
|
|
|
|
**EKS:**
|
|
```bash
|
|
aws eks update-cluster-version --name ${CLUSTER} --kubernetes-version ${VERSION}
|
|
```
|
|
|
|
**GKE:**
|
|
```bash
|
|
gcloud container clusters upgrade ${CLUSTER} --master --cluster-version ${VERSION}
|
|
```
|
|
|
|
**OpenShift:**
|
|
```bash
|
|
oc adm upgrade --to=${VERSION}
|
|
oc get clusterversion
|
|
```
|
|
|
|
### Backup with Velero
|
|
|
|
```bash
|
|
# Install Velero
|
|
velero install --provider ${PROVIDER} --bucket ${BUCKET} --secret-file ${CREDS}
|
|
|
|
# Create backup
|
|
velero backup create ${BACKUP_NAME} --include-namespaces ${NS}
|
|
|
|
# Restore
|
|
velero restore create --from-backup ${BACKUP_NAME}
|
|
```
|
|
|
|
---
|
|
|
|
## 2. TROUBLESHOOTING
|
|
|
|
### Health Assessment
|
|
|
|
Run the bundled script for comprehensive health check:
|
|
```bash
|
|
bash scripts/cluster-health-check.sh
|
|
```
|
|
|
|
### Pod Status Interpretation
|
|
|
|
| Status | Meaning | Action |
|
|
|--------|---------|--------|
|
|
| `Pending` | Scheduling issue | Check resources, nodeSelector, tolerations |
|
|
| `CrashLoopBackOff` | Container crashing | Check logs: `kubectl logs ${POD} --previous` |
|
|
| `ImagePullBackOff` | Image unavailable | Verify image name, registry access |
|
|
| `OOMKilled` | Out of memory | Increase memory limits |
|
|
| `Evicted` | Node pressure | Check node resources |
|
|
|
|
### Debugging Commands
|
|
|
|
```bash
|
|
# Pod logs (current and previous)
|
|
kubectl logs ${POD} -c ${CONTAINER} --previous
|
|
|
|
# Multi-pod logs with stern
|
|
stern ${LABEL_SELECTOR} -n ${NS}
|
|
|
|
# Exec into pod
|
|
kubectl exec -it ${POD} -- /bin/sh
|
|
|
|
# Pod events
|
|
kubectl describe pod ${POD} | grep -A 20 Events
|
|
|
|
# Cluster events (sorted by time)
|
|
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
|
|
```
|
|
|
|
### Network Troubleshooting
|
|
|
|
```bash
|
|
# Test DNS
|
|
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
|
|
|
|
# Test service connectivity
|
|
kubectl run -it --rm debug --image=curlimages/curl -- curl -v http://${SVC}.${NS}:${PORT}
|
|
|
|
# Check endpoints
|
|
kubectl get endpoints ${SVC}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. MANIFEST GENERATION
|
|
|
|
### Production Deployment Template
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
namespace: ${NAMESPACE}
|
|
labels:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
app.kubernetes.io/version: "${VERSION}"
|
|
spec:
|
|
replicas: 3
|
|
strategy:
|
|
type: RollingUpdate
|
|
rollingUpdate:
|
|
maxSurge: 1
|
|
maxUnavailable: 0
|
|
selector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
spec:
|
|
serviceAccountName: ${APP_NAME}
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 1000
|
|
fsGroup: 1000
|
|
seccompProfile:
|
|
type: RuntimeDefault
|
|
containers:
|
|
- name: ${APP_NAME}
|
|
image: ${IMAGE}:${TAG}
|
|
ports:
|
|
- name: http
|
|
containerPort: 8080
|
|
securityContext:
|
|
allowPrivilegeEscalation: false
|
|
readOnlyRootFilesystem: true
|
|
capabilities:
|
|
drop: ["ALL"]
|
|
resources:
|
|
requests:
|
|
cpu: 100m
|
|
memory: 128Mi
|
|
limits:
|
|
cpu: 500m
|
|
memory: 512Mi
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /healthz
|
|
port: http
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: http
|
|
initialDelaySeconds: 5
|
|
periodSeconds: 5
|
|
volumeMounts:
|
|
- name: tmp
|
|
mountPath: /tmp
|
|
volumes:
|
|
- name: tmp
|
|
emptyDir: {}
|
|
affinity:
|
|
podAntiAffinity:
|
|
preferredDuringSchedulingIgnoredDuringExecution:
|
|
- weight: 100
|
|
podAffinityTerm:
|
|
labelSelector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
topologyKey: kubernetes.io/hostname
|
|
```
|
|
|
|
### Service & Ingress
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
spec:
|
|
selector:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
ports:
|
|
- name: http
|
|
port: 80
|
|
targetPort: http
|
|
---
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
annotations:
|
|
nginx.ingress.kubernetes.io/ssl-redirect: "true"
|
|
spec:
|
|
ingressClassName: nginx
|
|
tls:
|
|
- hosts:
|
|
- ${HOST}
|
|
secretName: ${APP_NAME}-tls
|
|
rules:
|
|
- host: ${HOST}
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: ${APP_NAME}
|
|
port:
|
|
name: http
|
|
```
|
|
|
|
### OpenShift Route
|
|
|
|
```yaml
|
|
apiVersion: route.openshift.io/v1
|
|
kind: Route
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
spec:
|
|
to:
|
|
kind: Service
|
|
name: ${APP_NAME}
|
|
port:
|
|
targetPort: http
|
|
tls:
|
|
termination: edge
|
|
insecureEdgeTerminationPolicy: Redirect
|
|
```
|
|
|
|
Use the bundled script for manifest generation:
|
|
```bash
|
|
bash scripts/generate-manifest.sh deployment myapp production
|
|
```
|
|
|
|
---
|
|
|
|
## 4. SECURITY
|
|
|
|
### Security Audit
|
|
|
|
Run the bundled script:
|
|
```bash
|
|
bash scripts/security-audit.sh [namespace]
|
|
```
|
|
|
|
### Pod Security Standards
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: ${NAMESPACE}
|
|
labels:
|
|
pod-security.kubernetes.io/enforce: restricted
|
|
pod-security.kubernetes.io/audit: baseline
|
|
pod-security.kubernetes.io/warn: restricted
|
|
```
|
|
|
|
### NetworkPolicy (Zero Trust)
|
|
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: ${APP_NAME}-policy
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: ${APP_NAME}
|
|
policyTypes:
|
|
- Ingress
|
|
- Egress
|
|
ingress:
|
|
- from:
|
|
- podSelector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: frontend
|
|
ports:
|
|
- protocol: TCP
|
|
port: 8080
|
|
egress:
|
|
- to:
|
|
- podSelector:
|
|
matchLabels:
|
|
app.kubernetes.io/name: database
|
|
ports:
|
|
- protocol: TCP
|
|
port: 5432
|
|
# Allow DNS
|
|
- to:
|
|
- namespaceSelector: {}
|
|
podSelector:
|
|
matchLabels:
|
|
k8s-app: kube-dns
|
|
ports:
|
|
- protocol: UDP
|
|
port: 53
|
|
```
|
|
|
|
### RBAC Best Practices
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ServiceAccount
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: ${APP_NAME}-role
|
|
rules:
|
|
- apiGroups: [""]
|
|
resources: ["configmaps"]
|
|
verbs: ["get", "list"]
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: ${APP_NAME}-binding
|
|
subjects:
|
|
- kind: ServiceAccount
|
|
name: ${APP_NAME}
|
|
roleRef:
|
|
apiGroup: rbac.authorization.k8s.io
|
|
kind: Role
|
|
name: ${APP_NAME}-role
|
|
```
|
|
|
|
### Image Scanning
|
|
|
|
```bash
|
|
# Scan image with Trivy
|
|
trivy image ${IMAGE}:${TAG}
|
|
|
|
# Scan with severity filter
|
|
trivy image --severity HIGH,CRITICAL ${IMAGE}:${TAG}
|
|
|
|
# Generate SBOM
|
|
trivy image --format spdx-json -o sbom.json ${IMAGE}:${TAG}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. GITOPS
|
|
|
|
### ArgoCD Application
|
|
|
|
```yaml
|
|
apiVersion: argoproj.io/v1alpha1
|
|
kind: Application
|
|
metadata:
|
|
name: ${APP_NAME}
|
|
namespace: argocd
|
|
finalizers:
|
|
- resources-finalizer.argocd.argoproj.io
|
|
spec:
|
|
project: default
|
|
source:
|
|
repoURL: ${GIT_REPO}
|
|
targetRevision: main
|
|
path: k8s/overlays/${ENV}
|
|
destination:
|
|
server: https://kubernetes.default.svc
|
|
namespace: ${NAMESPACE}
|
|
syncPolicy:
|
|
automated:
|
|
prune: true
|
|
selfHeal: true
|
|
syncOptions:
|
|
- CreateNamespace=true
|
|
```
|
|
|
|
### Kustomize Structure
|
|
|
|
```
|
|
k8s/
|
|
├── base/
|
|
│ ├── kustomization.yaml
|
|
│ ├── deployment.yaml
|
|
│ └── service.yaml
|
|
└── overlays/
|
|
├── dev/
|
|
│ └── kustomization.yaml
|
|
├── staging/
|
|
│ └── kustomization.yaml
|
|
└── prod/
|
|
└── kustomization.yaml
|
|
```
|
|
|
|
**base/kustomization.yaml:**
|
|
```yaml
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
resources:
|
|
- deployment.yaml
|
|
- service.yaml
|
|
```
|
|
|
|
**overlays/prod/kustomization.yaml:**
|
|
```yaml
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
resources:
|
|
- ../../base
|
|
namePrefix: prod-
|
|
namespace: production
|
|
replicas:
|
|
- name: myapp
|
|
count: 5
|
|
images:
|
|
- name: myregistry/myapp
|
|
newTag: v1.2.3
|
|
```
|
|
|
|
### GitHub Actions CI/CD
|
|
|
|
```yaml
|
|
name: Build and Deploy
|
|
on:
|
|
push:
|
|
branches: [main]
|
|
|
|
jobs:
|
|
build:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Build and push image
|
|
uses: docker/build-push-action@v5
|
|
with:
|
|
push: true
|
|
tags: ${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}
|
|
|
|
- name: Update Kustomize image
|
|
run: |
|
|
cd k8s/overlays/prod
|
|
kustomize edit set image myapp=${{ secrets.REGISTRY }}/${{ github.event.repository.name }}:${{ github.sha }}
|
|
|
|
- name: Commit and push
|
|
run: |
|
|
git config user.name "github-actions"
|
|
git config user.email "github-actions@github.com"
|
|
git add .
|
|
git commit -m "Update image to ${{ github.sha }}"
|
|
git push
|
|
```
|
|
|
|
Use the bundled script for ArgoCD sync:
|
|
```bash
|
|
bash scripts/argocd-app-sync.sh ${APP_NAME} --prune
|
|
```
|
|
|
|
---
|
|
|
|
## Helper Scripts
|
|
|
|
This skill includes automation scripts in the `scripts/` directory:
|
|
|
|
| Script | Purpose |
|
|
|--------|---------|
|
|
| `cluster-health-check.sh` | Comprehensive cluster health assessment with scoring |
|
|
| `security-audit.sh` | Security posture audit (privileged, root, RBAC, NetworkPolicy) |
|
|
| `node-maintenance.sh` | Safe node drain and maintenance prep |
|
|
| `pre-upgrade-check.sh` | Pre-upgrade validation checklist |
|
|
| `generate-manifest.sh` | Generate production-ready K8s manifests |
|
|
| `argocd-app-sync.sh` | ArgoCD application sync helper |
|
|
|
|
Run any script:
|
|
```bash
|
|
bash scripts/<script-name>.sh [arguments]
|
|
```
|