# Task 11 - Kubernetes Deployment Validation Report ## Configuration Review Summary ### ✅ Correctly Configured #### 1. Tailscale Ingress All three ingress resources are properly defined: - **App** (`app.`) → web service port 3000 - **MinIO S3** (`minio.`) → MinIO port 9000 - **MinIO Console** (`minio-console.`) → MinIO console port 9001 Each ingress correctly: - Uses Tailscale ingress class - Configures TLS with the appropriate hostname - Routes to the correct service and port #### 2. Tailscale Service Option (LoadBalancer) Alternative exposure method via Tailscale LoadBalancer is available: - `helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl` - S3 API at `minio.` - `helm/porthole/templates/service-minio-tailscale-console.yaml.tpl` - Console at `minio-console.` Currently disabled (`minio.tailscaleServiceS3.enabled: false` in values.yaml). #### 3. Node Scheduling All heavy workloads are configured with `schedulingClass: compute`: - web (1Gi limit) - worker (2Gi limit) - postgres (2Gi limit) - redis (512Mi limit) - minio (2Gi limit) The scheduling helper (`_helpers.tpl:40-46`) applies the `scheduling.compute.affinity` which prefers nodes labeled with `node-class=compute`. #### 4. Longhorn PVCs Both stateful workloads use Longhorn PVCs: - Postgres: 20Gi storage - MinIO: 200Gi storage #### 5. Resource Limits All workloads have appropriate resource requests and limits for Pi hardware: - Web: 200m CPU / 256Mi → 1000m CPU / 1Gi - Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi - Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi - Redis: 50m CPU / 128Mi → 300m CPU / 512Mi - MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi #### 6. Cleanup CronJob Staging cleanup is properly configured but disabled by default: - Only targets `staging/` prefix (safe, never touches `originals/`) - Removes files older than 14 days - Must be enabled manually: `cronjobs.cleanupStaging.enabled: true` --- ### ⚠️ Issues & Recommendations #### 1. Node Affinity Now Uses "Required" **Status:** ✅ Fixed - Affinity changed to `requiredDuringSchedulingIgnoredDuringExecution`. All heavy workloads now require `node-class=compute` nodes (Pi 5). The Pi 3 node is tainted with `capacity=low:NoExecute`, which provides an additional safeguard preventing any pods from being scheduled on it. **Alternative:** Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with `node-class=tiny`): ```yaml scheduling: compute: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-class operator: In values: - compute requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-class operator: NotIn values: - tiny ``` #### 2. No Range Request Optimizations on Ingress **Issue:** The Tailscale ingress resources (`ingress-tailscale.yaml.tpl`) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests. **Risk:** Video seeking may be unreliable or fail for large files through Tailscale Ingress. **Recommendation 1 (Preferred):** Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming: ```yaml # In values.yaml minio: tailscaleServiceS3: enabled: true hostnameLabel: minio ``` This will: - Create a LoadBalancer service accessible via `https://minio.` - Provide more reliable Range request support - Bypass potential ingress buffering issues **Recommendation 2 (If using Ingress):** Add custom annotations for timeout/buffer optimization. Add to `values.yaml`: ```yaml minio: ingressS3: extraAnnotations: nginx.ingress.kubernetes.io/proxy-body-size: "500m" nginx.ingress.kubernetes.io/proxy-request-buffering: "off" nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" ``` Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings. #### 3. Cleanup CronJob Disabled by Default **Issue:** `cronjobs.cleanupStaging.enabled: false` in values.yaml means old staging files will accumulate indefinitely. **Risk:** Staging files from failed/interrupted uploads will fill up MinIO PVC over time. **Recommendation:** Enable cleanup after initial testing: ```bash helm upgrade --install porthole helm/porthole -f values.yaml \ --set cronjobs.cleanupStaging.enabled=true ``` Or set in values.yaml: ```yaml cronjobs: cleanupStaging: enabled: true ``` --- ## Deployment Validation Commands ### 1. Verify Pod Scheduling ```bash # Check all pods are on Pi 5 nodes (not Pi 3) kubectl get pods -n porthole -o wide # Expected: All pods except optional cronjobs should be on nodes with node-class=compute ``` ### 2. Verify Tailscale Endpoints ```bash # Check Tailscale ingress status kubectl get ingress -n porthole # If LoadBalancer service enabled: kubectl get svc -n porthole -l app.kubernetes.io/component=minio ``` ### 3. Verify PVCs ```bash # Check Longhorn PVCs are created and bound kubectl get pvc -n porthole # Check PVC status kubectl describe pvc -n porthole | grep -A 5 "Status:" ``` ### 4. Verify Resource Usage ```bash # Check current resource usage kubectl top pods -n porthole # Check resource requests/limits kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]} {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}' ``` ### 5. Test Presigned URL (HTTPS) ```bash # Get presigned URL (replace ) curl -sS "https://app./api/assets//url?variant=original" | jq .url # Expected: URL starts with "https://minio...." # NOT "http://..." ``` ### 6. Test Range Request Support ```bash # Get presigned URL URL=$(curl -sS "https://app./api/assets//url?variant=original" | jq -r .url) # Test Range request (request first 1KB) curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null # Expected: HTTP/1.1 206 Partial Content # If you see 200 OK, Range requests are not working ``` ### 7. Verify Worker Concurrency ```bash # Check BullMQ configuration in worker kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency" # Expected: concurrency: 1 (or at most 2 for Pi hardware) ``` ### 8. Test Timeline with Failed Assets ```bash # Query timeline with failed assets included curl -sS "https://app./api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)' # Should return nodes where some assets have status != 'ready' ``` ### 9. Database Verification ```bash # Connect to Postgres kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole -- Check failed assets SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10; -- Check assets without capture date (should not appear in timeline) SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL; -- Verify external originals not copied to canonical SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL; -- Should be 0 ``` --- ## End-to-End Deployment Verification Checklist ### Pre-Deployment - [ ] Label Pi 5 nodes: `kubectl label node node-class=compute` - [ ] Label Pi 5 nodes: `kubectl label node node-class=compute` - [ ] Verify Pi 3 has taint: `kubectl taint node capacity=low:NoExecute` - [ ] Set `global.tailscale.tailnetFQDN` in values.yaml - [ ] Set secret values (postgres password, minio credentials) - [ ] Build and push multi-arch images to registry ### Deployment ```bash # Install Helm chart helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace # Wait for pods to be ready kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m ``` ### Post-Deployment Verification - [ ] All pods are running on Pi 5 nodes (check `kubectl get pods -n porthole -o wide`) - [ ] PVCs are created and bound (`kubectl get pvc -n porthole`) - [ ] Tailscale endpoints are accessible: - [ ] `https://app.` - web UI loads - [ ] `https://minio.` - MinIO S3 accessible (mc ls) - [ ] `https://minio-console.` - MinIO console loads - [ ] Presigned URLs use HTTPS and point to tailnet hostname - [ ] Range requests return 206 Partial Content - [ ] Upload flow works: `/admin` → upload → asset appears in timeline - [ ] Scan flow works: trigger scan → `originals/` indexed → timeline populated - [ ] Failed assets show as placeholders without breaking UI - [ ] Video playback works for supported codecs; poster shown for unsupported - [ ] Worker memory usage stays within 2Gi limit during large file processing - [ ] No mixed-content warnings in browser console ### Performance Validation - [ ] Timeline tree loads and remains responsive - [ ] Zoom/pan works smoothly on mobile (test touch) - [ ] Video seeking works without stutter - [ ] Worker processes queue without OOM - [ ] Postgres memory stays within 2Gi - [ ] MinIO memory stays within 2Gi --- ## High-Risk Areas Summary | Risk | Impact | Likelihood | Mitigation | | -------------------------------------- | -------------------------------------- | ---------- | ------------------------------------------------------------------- | | Pi 3 node receives heavy pod | OOMKilled, cluster instability | Very Low | Required affinity + capacity=low:NoExecute taint prevent scheduling | | Tailscale Ingress Range request issues | Video seeking broken, poor UX | Medium | Enable `tailscaleServiceS3.enabled: true` for MinIO | | Worker OOM on large video processing | Worker crashes, queue stalls | Low | Concurrency=1 already set; monitor memory during testing | | MinIO presigned URL expiration | Videos stop playing mid-session | Low | 900s TTL is reasonable; user can re-open viewer | | Staging files accumulate | Disk fills up | Medium | Enable `cleanupStaging.enabled: true` | | Missing error boundaries | Component crashes show unhandled error | Low | Error boundaries now implemented | --- ## Next Steps 1. **Update node affinity** to `required` for compute class (or add anti-affinity for Pi 3) 2. **Enable Tailscale LoadBalancer service** for MinIO S3 for reliable Range requests 3. **Enable cleanup CronJob** after initial testing: `--set cronjobs.cleanupStaging.enabled=true` 4. **Deploy to cluster** and run validation commands 5. **Perform end-to-end testing** with real media (upload + scan) 6. **Monitor resource usage** during typical operations to confirm limits are appropriate