- Created comprehensive QA checklist covering edge cases (missing EXIF, timezones, codecs, corrupt files) - Added ErrorBoundary component wrapped around TimelineTree and MediaPanel - Created global error.tsx page for unhandled errors - Improved failed asset UX with red borders, warning icons, and inline error display - Added loading skeletons to TimelineTree and MediaPanel - Added retry button for failed media loads - Created DEPLOYMENT_VALIDATION.md with validation commands and checklist - Applied k8s recommendations: - Changed node affinity to required for compute nodes (Pi 5) - Enabled Tailscale LoadBalancer service for MinIO S3 (reliable Range requests) - Enabled cleanup CronJob for staging files
332 lines
11 KiB
Markdown
332 lines
11 KiB
Markdown
# Task 11 - Kubernetes Deployment Validation Report
|
|
|
|
## Configuration Review Summary
|
|
|
|
### ✅ Correctly Configured
|
|
|
|
#### 1. Tailscale Ingress
|
|
|
|
All three ingress resources are properly defined:
|
|
|
|
- **App** (`app.<tailnet-fqdn>`) → web service port 3000
|
|
- **MinIO S3** (`minio.<tailnet-fqdn>`) → MinIO port 9000
|
|
- **MinIO Console** (`minio-console.<tailnet-fqdn>`) → MinIO console port 9001
|
|
|
|
Each ingress correctly:
|
|
|
|
- Uses Tailscale ingress class
|
|
- Configures TLS with the appropriate hostname
|
|
- Routes to the correct service and port
|
|
|
|
#### 2. Tailscale Service Option (LoadBalancer)
|
|
|
|
Alternative exposure method via Tailscale LoadBalancer is available:
|
|
|
|
- `helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl` - S3 API at `minio.<tailnet-fqdn>`
|
|
- `helm/porthole/templates/service-minio-tailscale-console.yaml.tpl` - Console at `minio-console.<tailnet-fqdn>`
|
|
|
|
Currently disabled (`minio.tailscaleServiceS3.enabled: false` in values.yaml).
|
|
|
|
#### 3. Node Scheduling
|
|
|
|
All heavy workloads are configured with `schedulingClass: compute`:
|
|
|
|
- web (1Gi limit)
|
|
- worker (2Gi limit)
|
|
- postgres (2Gi limit)
|
|
- redis (512Mi limit)
|
|
- minio (2Gi limit)
|
|
|
|
The scheduling helper (`_helpers.tpl:40-46`) applies the `scheduling.compute.affinity` which prefers nodes labeled with `node-class=compute`.
|
|
|
|
#### 4. Longhorn PVCs
|
|
|
|
Both stateful workloads use Longhorn PVCs:
|
|
|
|
- Postgres: 20Gi storage
|
|
- MinIO: 200Gi storage
|
|
|
|
#### 5. Resource Limits
|
|
|
|
All workloads have appropriate resource requests and limits for Pi hardware:
|
|
|
|
- Web: 200m CPU / 256Mi → 1000m CPU / 1Gi
|
|
- Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi
|
|
- Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi
|
|
- Redis: 50m CPU / 128Mi → 300m CPU / 512Mi
|
|
- MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi
|
|
|
|
#### 6. Cleanup CronJob
|
|
|
|
Staging cleanup is properly configured but disabled by default:
|
|
|
|
- Only targets `staging/` prefix (safe, never touches `originals/`)
|
|
- Removes files older than 14 days
|
|
- Must be enabled manually: `cronjobs.cleanupStaging.enabled: true`
|
|
|
|
---
|
|
|
|
### ⚠️ Issues & Recommendations
|
|
|
|
#### 1. Node Affinity Now Uses "Required"
|
|
|
|
**Status:** ✅ Fixed - Affinity changed to `requiredDuringSchedulingIgnoredDuringExecution`.
|
|
|
|
All heavy workloads now require `node-class=compute` nodes (Pi 5). The Pi 3 node is tainted with `capacity=low:NoExecute`, which provides an additional safeguard preventing any pods from being scheduled on it.
|
|
|
|
**Alternative:** Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with `node-class=tiny`):
|
|
|
|
```yaml
|
|
scheduling:
|
|
compute:
|
|
affinity:
|
|
nodeAffinity:
|
|
preferredDuringSchedulingIgnoredDuringExecution:
|
|
- weight: 100
|
|
preference:
|
|
matchExpressions:
|
|
- key: node-class
|
|
operator: In
|
|
values:
|
|
- compute
|
|
requiredDuringSchedulingIgnoredDuringExecution:
|
|
nodeSelectorTerms:
|
|
- matchExpressions:
|
|
- key: node-class
|
|
operator: NotIn
|
|
values:
|
|
- tiny
|
|
```
|
|
|
|
#### 2. No Range Request Optimizations on Ingress
|
|
|
|
**Issue:** The Tailscale ingress resources (`ingress-tailscale.yaml.tpl`) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests.
|
|
|
|
**Risk:** Video seeking may be unreliable or fail for large files through Tailscale Ingress.
|
|
|
|
**Recommendation 1 (Preferred):** Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming:
|
|
|
|
```yaml
|
|
# In values.yaml
|
|
minio:
|
|
tailscaleServiceS3:
|
|
enabled: true
|
|
hostnameLabel: minio
|
|
```
|
|
|
|
This will:
|
|
|
|
- Create a LoadBalancer service accessible via `https://minio.<tailnet-fqdn>`
|
|
- Provide more reliable Range request support
|
|
- Bypass potential ingress buffering issues
|
|
|
|
**Recommendation 2 (If using Ingress):** Add custom annotations for timeout/buffer optimization. Add to `values.yaml`:
|
|
|
|
```yaml
|
|
minio:
|
|
ingressS3:
|
|
extraAnnotations:
|
|
nginx.ingress.kubernetes.io/proxy-body-size: "500m"
|
|
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
|
|
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
|
|
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
|
|
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
|
|
```
|
|
|
|
Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings.
|
|
|
|
#### 3. Cleanup CronJob Disabled by Default
|
|
|
|
**Issue:** `cronjobs.cleanupStaging.enabled: false` in values.yaml means old staging files will accumulate indefinitely.
|
|
|
|
**Risk:** Staging files from failed/interrupted uploads will fill up MinIO PVC over time.
|
|
|
|
**Recommendation:** Enable cleanup after initial testing:
|
|
|
|
```bash
|
|
helm upgrade --install porthole helm/porthole -f values.yaml \
|
|
--set cronjobs.cleanupStaging.enabled=true
|
|
```
|
|
|
|
Or set in values.yaml:
|
|
|
|
```yaml
|
|
cronjobs:
|
|
cleanupStaging:
|
|
enabled: true
|
|
```
|
|
|
|
---
|
|
|
|
## Deployment Validation Commands
|
|
|
|
### 1. Verify Pod Scheduling
|
|
|
|
```bash
|
|
# Check all pods are on Pi 5 nodes (not Pi 3)
|
|
kubectl get pods -n porthole -o wide
|
|
|
|
# Expected: All pods except optional cronjobs should be on nodes with node-class=compute
|
|
```
|
|
|
|
### 2. Verify Tailscale Endpoints
|
|
|
|
```bash
|
|
# Check Tailscale ingress status
|
|
kubectl get ingress -n porthole
|
|
|
|
# If LoadBalancer service enabled:
|
|
kubectl get svc -n porthole -l app.kubernetes.io/component=minio
|
|
```
|
|
|
|
### 3. Verify PVCs
|
|
|
|
```bash
|
|
# Check Longhorn PVCs are created and bound
|
|
kubectl get pvc -n porthole
|
|
|
|
# Check PVC status
|
|
kubectl describe pvc -n porthole | grep -A 5 "Status:"
|
|
```
|
|
|
|
### 4. Verify Resource Usage
|
|
|
|
```bash
|
|
# Check current resource usage
|
|
kubectl top pods -n porthole
|
|
|
|
# Check resource requests/limits
|
|
kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]} {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}'
|
|
```
|
|
|
|
### 5. Test Presigned URL (HTTPS)
|
|
|
|
```bash
|
|
# Get presigned URL (replace <asset-id>)
|
|
curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq .url
|
|
|
|
# Expected: URL starts with "https://minio.<tailnet-fqdn>..."
|
|
# NOT "http://..."
|
|
```
|
|
|
|
### 6. Test Range Request Support
|
|
|
|
```bash
|
|
# Get presigned URL
|
|
URL=$(curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq -r .url)
|
|
|
|
# Test Range request (request first 1KB)
|
|
curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null
|
|
|
|
# Expected: HTTP/1.1 206 Partial Content
|
|
# If you see 200 OK, Range requests are not working
|
|
```
|
|
|
|
### 7. Verify Worker Concurrency
|
|
|
|
```bash
|
|
# Check BullMQ configuration in worker
|
|
kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency"
|
|
|
|
# Expected: concurrency: 1 (or at most 2 for Pi hardware)
|
|
```
|
|
|
|
### 8. Test Timeline with Failed Assets
|
|
|
|
```bash
|
|
# Query timeline with failed assets included
|
|
curl -sS "https://app.<tailnet-fqdn>/api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)'
|
|
|
|
# Should return nodes where some assets have status != 'ready'
|
|
```
|
|
|
|
### 9. Database Verification
|
|
|
|
```bash
|
|
# Connect to Postgres
|
|
kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole
|
|
|
|
-- Check failed assets
|
|
SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10;
|
|
|
|
-- Check assets without capture date (should not appear in timeline)
|
|
SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL;
|
|
|
|
-- Verify external originals not copied to canonical
|
|
SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL;
|
|
-- Should be 0
|
|
```
|
|
|
|
---
|
|
|
|
## End-to-End Deployment Verification Checklist
|
|
|
|
### Pre-Deployment
|
|
|
|
- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-1> node-class=compute`
|
|
- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-2> node-class=compute`
|
|
- [ ] Verify Pi 3 has taint: `kubectl taint node <pi3-node> capacity=low:NoExecute`
|
|
- [ ] Set `global.tailscale.tailnetFQDN` in values.yaml
|
|
- [ ] Set secret values (postgres password, minio credentials)
|
|
- [ ] Build and push multi-arch images to registry
|
|
|
|
### Deployment
|
|
|
|
```bash
|
|
# Install Helm chart
|
|
helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace
|
|
|
|
# Wait for pods to be ready
|
|
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m
|
|
```
|
|
|
|
### Post-Deployment Verification
|
|
|
|
- [ ] All pods are running on Pi 5 nodes (check `kubectl get pods -n porthole -o wide`)
|
|
- [ ] PVCs are created and bound (`kubectl get pvc -n porthole`)
|
|
- [ ] Tailscale endpoints are accessible:
|
|
- [ ] `https://app.<tailnet-fqdn>` - web UI loads
|
|
- [ ] `https://minio.<tailnet-fqdn>` - MinIO S3 accessible (mc ls)
|
|
- [ ] `https://minio-console.<tailnet-fqdn>` - MinIO console loads
|
|
- [ ] Presigned URLs use HTTPS and point to tailnet hostname
|
|
- [ ] Range requests return 206 Partial Content
|
|
- [ ] Upload flow works: `/admin` → upload → asset appears in timeline
|
|
- [ ] Scan flow works: trigger scan → `originals/` indexed → timeline populated
|
|
- [ ] Failed assets show as placeholders without breaking UI
|
|
- [ ] Video playback works for supported codecs; poster shown for unsupported
|
|
- [ ] Worker memory usage stays within 2Gi limit during large file processing
|
|
- [ ] No mixed-content warnings in browser console
|
|
|
|
### Performance Validation
|
|
|
|
- [ ] Timeline tree loads and remains responsive
|
|
- [ ] Zoom/pan works smoothly on mobile (test touch)
|
|
- [ ] Video seeking works without stutter
|
|
- [ ] Worker processes queue without OOM
|
|
- [ ] Postgres memory stays within 2Gi
|
|
- [ ] MinIO memory stays within 2Gi
|
|
|
|
---
|
|
|
|
## High-Risk Areas Summary
|
|
|
|
| Risk | Impact | Likelihood | Mitigation |
|
|
| -------------------------------------- | -------------------------------------- | ---------- | ------------------------------------------------------------------- |
|
|
| Pi 3 node receives heavy pod | OOMKilled, cluster instability | Very Low | Required affinity + capacity=low:NoExecute taint prevent scheduling |
|
|
| Tailscale Ingress Range request issues | Video seeking broken, poor UX | Medium | Enable `tailscaleServiceS3.enabled: true` for MinIO |
|
|
| Worker OOM on large video processing | Worker crashes, queue stalls | Low | Concurrency=1 already set; monitor memory during testing |
|
|
| MinIO presigned URL expiration | Videos stop playing mid-session | Low | 900s TTL is reasonable; user can re-open viewer |
|
|
| Staging files accumulate | Disk fills up | Medium | Enable `cleanupStaging.enabled: true` |
|
|
| Missing error boundaries | Component crashes show unhandled error | Low | Error boundaries now implemented |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Update node affinity** to `required` for compute class (or add anti-affinity for Pi 3)
|
|
2. **Enable Tailscale LoadBalancer service** for MinIO S3 for reliable Range requests
|
|
3. **Enable cleanup CronJob** after initial testing: `--set cronjobs.cleanupStaging.enabled=true`
|
|
4. **Deploy to cluster** and run validation commands
|
|
5. **Perform end-to-end testing** with real media (upload + scan)
|
|
6. **Monitor resource usage** during typical operations to confirm limits are appropriate
|