- Created comprehensive QA checklist covering edge cases (missing EXIF, timezones, codecs, corrupt files) - Added ErrorBoundary component wrapped around TimelineTree and MediaPanel - Created global error.tsx page for unhandled errors - Improved failed asset UX with red borders, warning icons, and inline error display - Added loading skeletons to TimelineTree and MediaPanel - Added retry button for failed media loads - Created DEPLOYMENT_VALIDATION.md with validation commands and checklist - Applied k8s recommendations: - Changed node affinity to required for compute nodes (Pi 5) - Enabled Tailscale LoadBalancer service for MinIO S3 (reliable Range requests) - Enabled cleanup CronJob for staging files
11 KiB
Task 11 - Kubernetes Deployment Validation Report
Configuration Review Summary
✅ Correctly Configured
1. Tailscale Ingress
All three ingress resources are properly defined:
- App (
app.<tailnet-fqdn>) → web service port 3000 - MinIO S3 (
minio.<tailnet-fqdn>) → MinIO port 9000 - MinIO Console (
minio-console.<tailnet-fqdn>) → MinIO console port 9001
Each ingress correctly:
- Uses Tailscale ingress class
- Configures TLS with the appropriate hostname
- Routes to the correct service and port
2. Tailscale Service Option (LoadBalancer)
Alternative exposure method via Tailscale LoadBalancer is available:
helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl- S3 API atminio.<tailnet-fqdn>helm/porthole/templates/service-minio-tailscale-console.yaml.tpl- Console atminio-console.<tailnet-fqdn>
Currently disabled (minio.tailscaleServiceS3.enabled: false in values.yaml).
3. Node Scheduling
All heavy workloads are configured with schedulingClass: compute:
- web (1Gi limit)
- worker (2Gi limit)
- postgres (2Gi limit)
- redis (512Mi limit)
- minio (2Gi limit)
The scheduling helper (_helpers.tpl:40-46) applies the scheduling.compute.affinity which prefers nodes labeled with node-class=compute.
4. Longhorn PVCs
Both stateful workloads use Longhorn PVCs:
- Postgres: 20Gi storage
- MinIO: 200Gi storage
5. Resource Limits
All workloads have appropriate resource requests and limits for Pi hardware:
- Web: 200m CPU / 256Mi → 1000m CPU / 1Gi
- Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi
- Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi
- Redis: 50m CPU / 128Mi → 300m CPU / 512Mi
- MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi
6. Cleanup CronJob
Staging cleanup is properly configured but disabled by default:
- Only targets
staging/prefix (safe, never touchesoriginals/) - Removes files older than 14 days
- Must be enabled manually:
cronjobs.cleanupStaging.enabled: true
⚠️ Issues & Recommendations
1. Node Affinity Now Uses "Required"
Status: ✅ Fixed - Affinity changed to requiredDuringSchedulingIgnoredDuringExecution.
All heavy workloads now require node-class=compute nodes (Pi 5). The Pi 3 node is tainted with capacity=low:NoExecute, which provides an additional safeguard preventing any pods from being scheduled on it.
Alternative: Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with node-class=tiny):
scheduling:
compute:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-class
operator: In
values:
- compute
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-class
operator: NotIn
values:
- tiny
2. No Range Request Optimizations on Ingress
Issue: The Tailscale ingress resources (ingress-tailscale.yaml.tpl) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests.
Risk: Video seeking may be unreliable or fail for large files through Tailscale Ingress.
Recommendation 1 (Preferred): Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming:
# In values.yaml
minio:
tailscaleServiceS3:
enabled: true
hostnameLabel: minio
This will:
- Create a LoadBalancer service accessible via
https://minio.<tailnet-fqdn> - Provide more reliable Range request support
- Bypass potential ingress buffering issues
Recommendation 2 (If using Ingress): Add custom annotations for timeout/buffer optimization. Add to values.yaml:
minio:
ingressS3:
extraAnnotations:
nginx.ingress.kubernetes.io/proxy-body-size: "500m"
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings.
3. Cleanup CronJob Disabled by Default
Issue: cronjobs.cleanupStaging.enabled: false in values.yaml means old staging files will accumulate indefinitely.
Risk: Staging files from failed/interrupted uploads will fill up MinIO PVC over time.
Recommendation: Enable cleanup after initial testing:
helm upgrade --install porthole helm/porthole -f values.yaml \
--set cronjobs.cleanupStaging.enabled=true
Or set in values.yaml:
cronjobs:
cleanupStaging:
enabled: true
Deployment Validation Commands
1. Verify Pod Scheduling
# Check all pods are on Pi 5 nodes (not Pi 3)
kubectl get pods -n porthole -o wide
# Expected: All pods except optional cronjobs should be on nodes with node-class=compute
2. Verify Tailscale Endpoints
# Check Tailscale ingress status
kubectl get ingress -n porthole
# If LoadBalancer service enabled:
kubectl get svc -n porthole -l app.kubernetes.io/component=minio
3. Verify PVCs
# Check Longhorn PVCs are created and bound
kubectl get pvc -n porthole
# Check PVC status
kubectl describe pvc -n porthole | grep -A 5 "Status:"
4. Verify Resource Usage
# Check current resource usage
kubectl top pods -n porthole
# Check resource requests/limits
kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]} {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}'
5. Test Presigned URL (HTTPS)
# Get presigned URL (replace <asset-id>)
curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq .url
# Expected: URL starts with "https://minio.<tailnet-fqdn>..."
# NOT "http://..."
6. Test Range Request Support
# Get presigned URL
URL=$(curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq -r .url)
# Test Range request (request first 1KB)
curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null
# Expected: HTTP/1.1 206 Partial Content
# If you see 200 OK, Range requests are not working
7. Verify Worker Concurrency
# Check BullMQ configuration in worker
kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency"
# Expected: concurrency: 1 (or at most 2 for Pi hardware)
8. Test Timeline with Failed Assets
# Query timeline with failed assets included
curl -sS "https://app.<tailnet-fqdn>/api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)'
# Should return nodes where some assets have status != 'ready'
9. Database Verification
# Connect to Postgres
kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole
-- Check failed assets
SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10;
-- Check assets without capture date (should not appear in timeline)
SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL;
-- Verify external originals not copied to canonical
SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL;
-- Should be 0
End-to-End Deployment Verification Checklist
Pre-Deployment
- Label Pi 5 nodes:
kubectl label node <pi5-node-1> node-class=compute - Label Pi 5 nodes:
kubectl label node <pi5-node-2> node-class=compute - Verify Pi 3 has taint:
kubectl taint node <pi3-node> capacity=low:NoExecute - Set
global.tailscale.tailnetFQDNin values.yaml - Set secret values (postgres password, minio credentials)
- Build and push multi-arch images to registry
Deployment
# Install Helm chart
helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m
Post-Deployment Verification
- All pods are running on Pi 5 nodes (check
kubectl get pods -n porthole -o wide) - PVCs are created and bound (
kubectl get pvc -n porthole) - Tailscale endpoints are accessible:
https://app.<tailnet-fqdn>- web UI loadshttps://minio.<tailnet-fqdn>- MinIO S3 accessible (mc ls)https://minio-console.<tailnet-fqdn>- MinIO console loads
- Presigned URLs use HTTPS and point to tailnet hostname
- Range requests return 206 Partial Content
- Upload flow works:
/admin→ upload → asset appears in timeline - Scan flow works: trigger scan →
originals/indexed → timeline populated - Failed assets show as placeholders without breaking UI
- Video playback works for supported codecs; poster shown for unsupported
- Worker memory usage stays within 2Gi limit during large file processing
- No mixed-content warnings in browser console
Performance Validation
- Timeline tree loads and remains responsive
- Zoom/pan works smoothly on mobile (test touch)
- Video seeking works without stutter
- Worker processes queue without OOM
- Postgres memory stays within 2Gi
- MinIO memory stays within 2Gi
High-Risk Areas Summary
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Pi 3 node receives heavy pod | OOMKilled, cluster instability | Very Low | Required affinity + capacity=low:NoExecute taint prevent scheduling |
| Tailscale Ingress Range request issues | Video seeking broken, poor UX | Medium | Enable tailscaleServiceS3.enabled: true for MinIO |
| Worker OOM on large video processing | Worker crashes, queue stalls | Low | Concurrency=1 already set; monitor memory during testing |
| MinIO presigned URL expiration | Videos stop playing mid-session | Low | 900s TTL is reasonable; user can re-open viewer |
| Staging files accumulate | Disk fills up | Medium | Enable cleanupStaging.enabled: true |
| Missing error boundaries | Component crashes show unhandled error | Low | Error boundaries now implemented |
Next Steps
- Update node affinity to
requiredfor compute class (or add anti-affinity for Pi 3) - Enable Tailscale LoadBalancer service for MinIO S3 for reliable Range requests
- Enable cleanup CronJob after initial testing:
--set cronjobs.cleanupStaging.enabled=true - Deploy to cluster and run validation commands
- Perform end-to-end testing with real media (upload + scan)
- Monitor resource usage during typical operations to confirm limits are appropriate