Files

OpenCode Test 4e2ab7cdd8 task-11: complete QA + hardening with resilience fixes

- Created comprehensive QA checklist covering edge cases (missing EXIF, timezones, codecs, corrupt files)
- Added ErrorBoundary component wrapped around TimelineTree and MediaPanel
- Created global error.tsx page for unhandled errors
- Improved failed asset UX with red borders, warning icons, and inline error display
- Added loading skeletons to TimelineTree and MediaPanel
- Added retry button for failed media loads
- Created DEPLOYMENT_VALIDATION.md with validation commands and checklist
- Applied k8s recommendations:
  - Changed node affinity to required for compute nodes (Pi 5)
  - Enabled Tailscale LoadBalancer service for MinIO S3 (reliable Range requests)
  - Enabled cleanup CronJob for staging files

2025-12-24 12:45:22 -08:00

11 KiB

Raw Permalink Blame History

Task 11 - Kubernetes Deployment Validation Report

Configuration Review Summary

✅ Correctly Configured

1. Tailscale Ingress

All three ingress resources are properly defined:

App (app.<tailnet-fqdn>) → web service port 3000
MinIO S3 (minio.<tailnet-fqdn>) → MinIO port 9000
MinIO Console (minio-console.<tailnet-fqdn>) → MinIO console port 9001

Each ingress correctly:

Uses Tailscale ingress class
Configures TLS with the appropriate hostname
Routes to the correct service and port

2. Tailscale Service Option (LoadBalancer)

Alternative exposure method via Tailscale LoadBalancer is available:

helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl - S3 API at minio.<tailnet-fqdn>
helm/porthole/templates/service-minio-tailscale-console.yaml.tpl - Console at minio-console.<tailnet-fqdn>

Currently disabled (minio.tailscaleServiceS3.enabled: false in values.yaml).

3. Node Scheduling

All heavy workloads are configured with schedulingClass: compute:

web (1Gi limit)
worker (2Gi limit)
postgres (2Gi limit)
redis (512Mi limit)
minio (2Gi limit)

The scheduling helper (_helpers.tpl:40-46) applies the scheduling.compute.affinity which prefers nodes labeled with node-class=compute.

4. Longhorn PVCs

Both stateful workloads use Longhorn PVCs:

Postgres: 20Gi storage
MinIO: 200Gi storage

5. Resource Limits

All workloads have appropriate resource requests and limits for Pi hardware:

Web: 200m CPU / 256Mi → 1000m CPU / 1Gi
Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi
Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi
Redis: 50m CPU / 128Mi → 300m CPU / 512Mi
MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi

6. Cleanup CronJob

Staging cleanup is properly configured but disabled by default:

Only targets staging/ prefix (safe, never touches originals/)
Removes files older than 14 days
Must be enabled manually: cronjobs.cleanupStaging.enabled: true

⚠️ Issues & Recommendations

1. Node Affinity Now Uses "Required"

Status: ✅ Fixed - Affinity changed to requiredDuringSchedulingIgnoredDuringExecution.

All heavy workloads now require node-class=compute nodes (Pi 5). The Pi 3 node is tainted with capacity=low:NoExecute, which provides an additional safeguard preventing any pods from being scheduled on it.

Alternative: Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with node-class=tiny):

scheduling:
  compute:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
                - key: node-class
                  operator: In
                  values:
                    - compute
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: node-class
                  operator: NotIn
                  values:
                    - tiny

2. No Range Request Optimizations on Ingress

Issue: The Tailscale ingress resources (ingress-tailscale.yaml.tpl) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests.

Risk: Video seeking may be unreliable or fail for large files through Tailscale Ingress.

Recommendation 1 (Preferred): Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming:

# In values.yaml
minio:
  tailscaleServiceS3:
    enabled: true
    hostnameLabel: minio

This will:

Create a LoadBalancer service accessible via https://minio.<tailnet-fqdn>
Provide more reliable Range request support
Bypass potential ingress buffering issues

Recommendation 2 (If using Ingress): Add custom annotations for timeout/buffer optimization. Add to values.yaml:

minio:
  ingressS3:
    extraAnnotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "500m"
      nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
      nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"

Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings.

3. Cleanup CronJob Disabled by Default

Issue: cronjobs.cleanupStaging.enabled: false in values.yaml means old staging files will accumulate indefinitely.

Risk: Staging files from failed/interrupted uploads will fill up MinIO PVC over time.

Recommendation: Enable cleanup after initial testing:

helm upgrade --install porthole helm/porthole -f values.yaml \
  --set cronjobs.cleanupStaging.enabled=true

Or set in values.yaml:

cronjobs:
  cleanupStaging:
    enabled: true

Deployment Validation Commands

1. Verify Pod Scheduling

# Check all pods are on Pi 5 nodes (not Pi 3)
kubectl get pods -n porthole -o wide

# Expected: All pods except optional cronjobs should be on nodes with node-class=compute

2. Verify Tailscale Endpoints

# Check Tailscale ingress status
kubectl get ingress -n porthole

# If LoadBalancer service enabled:
kubectl get svc -n porthole -l app.kubernetes.io/component=minio

3. Verify PVCs

# Check Longhorn PVCs are created and bound
kubectl get pvc -n porthole

# Check PVC status
kubectl describe pvc -n porthole | grep -A 5 "Status:"

4. Verify Resource Usage

# Check current resource usage
kubectl top pods -n porthole

# Check resource requests/limits
kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]}  {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}'

5. Test Presigned URL (HTTPS)

# Get presigned URL (replace <asset-id>)
curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq .url

# Expected: URL starts with "https://minio.<tailnet-fqdn>..."
# NOT "http://..."

6. Test Range Request Support

# Get presigned URL
URL=$(curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq -r .url)

# Test Range request (request first 1KB)
curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null

# Expected: HTTP/1.1 206 Partial Content
# If you see 200 OK, Range requests are not working

7. Verify Worker Concurrency

# Check BullMQ configuration in worker
kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency"

# Expected: concurrency: 1 (or at most 2 for Pi hardware)

8. Test Timeline with Failed Assets

# Query timeline with failed assets included
curl -sS "https://app.<tailnet-fqdn>/api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)'

# Should return nodes where some assets have status != 'ready'

9. Database Verification

# Connect to Postgres
kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole

-- Check failed assets
SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10;

-- Check assets without capture date (should not appear in timeline)
SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL;

-- Verify external originals not copied to canonical
SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL;
-- Should be 0

End-to-End Deployment Verification Checklist

Pre-Deployment

Label Pi 5 nodes: kubectl label node <pi5-node-1> node-class=compute
Label Pi 5 nodes: kubectl label node <pi5-node-2> node-class=compute
Verify Pi 3 has taint: kubectl taint node <pi3-node> capacity=low:NoExecute
Set global.tailscale.tailnetFQDN in values.yaml
Set secret values (postgres password, minio credentials)
Build and push multi-arch images to registry

Deployment

# Install Helm chart
helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m

Post-Deployment Verification

All pods are running on Pi 5 nodes (check kubectl get pods -n porthole -o wide)
PVCs are created and bound (kubectl get pvc -n porthole)
Tailscale endpoints are accessible:
- https://app.<tailnet-fqdn> - web UI loads
- https://minio.<tailnet-fqdn> - MinIO S3 accessible (mc ls)
- https://minio-console.<tailnet-fqdn> - MinIO console loads
Presigned URLs use HTTPS and point to tailnet hostname
Range requests return 206 Partial Content
Upload flow works: /admin → upload → asset appears in timeline
Scan flow works: trigger scan → originals/ indexed → timeline populated
Failed assets show as placeholders without breaking UI
Video playback works for supported codecs; poster shown for unsupported
Worker memory usage stays within 2Gi limit during large file processing
No mixed-content warnings in browser console

Performance Validation

Timeline tree loads and remains responsive
Zoom/pan works smoothly on mobile (test touch)
Video seeking works without stutter
Worker processes queue without OOM
Postgres memory stays within 2Gi
MinIO memory stays within 2Gi

High-Risk Areas Summary

Risk	Impact	Likelihood	Mitigation
Pi 3 node receives heavy pod	OOMKilled, cluster instability	Very Low	Required affinity + capacity=low:NoExecute taint prevent scheduling
Tailscale Ingress Range request issues	Video seeking broken, poor UX	Medium	Enable `tailscaleServiceS3.enabled: true` for MinIO
Worker OOM on large video processing	Worker crashes, queue stalls	Low	Concurrency=1 already set; monitor memory during testing
MinIO presigned URL expiration	Videos stop playing mid-session	Low	900s TTL is reasonable; user can re-open viewer
Staging files accumulate	Disk fills up	Medium	Enable `cleanupStaging.enabled: true`
Missing error boundaries	Component crashes show unhandled error	Low	Error boundaries now implemented

Next Steps

Update node affinity to required for compute class (or add anti-affinity for Pi 3)
Enable Tailscale LoadBalancer service for MinIO S3 for reliable Range requests
Enable cleanup CronJob after initial testing: --set cronjobs.cleanupStaging.enabled=true
Deploy to cluster and run validation commands
Perform end-to-end testing with real media (upload + scan)
Monitor resource usage during typical operations to confirm limits are appropriate

11 KiB Raw Permalink Blame History