Files
porthole/DEPLOYMENT_VALIDATION.md
OpenCode Test 4e2ab7cdd8 task-11: complete QA + hardening with resilience fixes
- Created comprehensive QA checklist covering edge cases (missing EXIF, timezones, codecs, corrupt files)
- Added ErrorBoundary component wrapped around TimelineTree and MediaPanel
- Created global error.tsx page for unhandled errors
- Improved failed asset UX with red borders, warning icons, and inline error display
- Added loading skeletons to TimelineTree and MediaPanel
- Added retry button for failed media loads
- Created DEPLOYMENT_VALIDATION.md with validation commands and checklist
- Applied k8s recommendations:
  - Changed node affinity to required for compute nodes (Pi 5)
  - Enabled Tailscale LoadBalancer service for MinIO S3 (reliable Range requests)
  - Enabled cleanup CronJob for staging files
2025-12-24 12:45:22 -08:00

332 lines
11 KiB
Markdown

# Task 11 - Kubernetes Deployment Validation Report
## Configuration Review Summary
### ✅ Correctly Configured
#### 1. Tailscale Ingress
All three ingress resources are properly defined:
- **App** (`app.<tailnet-fqdn>`) → web service port 3000
- **MinIO S3** (`minio.<tailnet-fqdn>`) → MinIO port 9000
- **MinIO Console** (`minio-console.<tailnet-fqdn>`) → MinIO console port 9001
Each ingress correctly:
- Uses Tailscale ingress class
- Configures TLS with the appropriate hostname
- Routes to the correct service and port
#### 2. Tailscale Service Option (LoadBalancer)
Alternative exposure method via Tailscale LoadBalancer is available:
- `helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl` - S3 API at `minio.<tailnet-fqdn>`
- `helm/porthole/templates/service-minio-tailscale-console.yaml.tpl` - Console at `minio-console.<tailnet-fqdn>`
Currently disabled (`minio.tailscaleServiceS3.enabled: false` in values.yaml).
#### 3. Node Scheduling
All heavy workloads are configured with `schedulingClass: compute`:
- web (1Gi limit)
- worker (2Gi limit)
- postgres (2Gi limit)
- redis (512Mi limit)
- minio (2Gi limit)
The scheduling helper (`_helpers.tpl:40-46`) applies the `scheduling.compute.affinity` which prefers nodes labeled with `node-class=compute`.
#### 4. Longhorn PVCs
Both stateful workloads use Longhorn PVCs:
- Postgres: 20Gi storage
- MinIO: 200Gi storage
#### 5. Resource Limits
All workloads have appropriate resource requests and limits for Pi hardware:
- Web: 200m CPU / 256Mi → 1000m CPU / 1Gi
- Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi
- Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi
- Redis: 50m CPU / 128Mi → 300m CPU / 512Mi
- MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi
#### 6. Cleanup CronJob
Staging cleanup is properly configured but disabled by default:
- Only targets `staging/` prefix (safe, never touches `originals/`)
- Removes files older than 14 days
- Must be enabled manually: `cronjobs.cleanupStaging.enabled: true`
---
### ⚠️ Issues & Recommendations
#### 1. Node Affinity Now Uses "Required"
**Status:** ✅ Fixed - Affinity changed to `requiredDuringSchedulingIgnoredDuringExecution`.
All heavy workloads now require `node-class=compute` nodes (Pi 5). The Pi 3 node is tainted with `capacity=low:NoExecute`, which provides an additional safeguard preventing any pods from being scheduled on it.
**Alternative:** Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with `node-class=tiny`):
```yaml
scheduling:
compute:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-class
operator: In
values:
- compute
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-class
operator: NotIn
values:
- tiny
```
#### 2. No Range Request Optimizations on Ingress
**Issue:** The Tailscale ingress resources (`ingress-tailscale.yaml.tpl`) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests.
**Risk:** Video seeking may be unreliable or fail for large files through Tailscale Ingress.
**Recommendation 1 (Preferred):** Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming:
```yaml
# In values.yaml
minio:
tailscaleServiceS3:
enabled: true
hostnameLabel: minio
```
This will:
- Create a LoadBalancer service accessible via `https://minio.<tailnet-fqdn>`
- Provide more reliable Range request support
- Bypass potential ingress buffering issues
**Recommendation 2 (If using Ingress):** Add custom annotations for timeout/buffer optimization. Add to `values.yaml`:
```yaml
minio:
ingressS3:
extraAnnotations:
nginx.ingress.kubernetes.io/proxy-body-size: "500m"
nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
```
Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings.
#### 3. Cleanup CronJob Disabled by Default
**Issue:** `cronjobs.cleanupStaging.enabled: false` in values.yaml means old staging files will accumulate indefinitely.
**Risk:** Staging files from failed/interrupted uploads will fill up MinIO PVC over time.
**Recommendation:** Enable cleanup after initial testing:
```bash
helm upgrade --install porthole helm/porthole -f values.yaml \
--set cronjobs.cleanupStaging.enabled=true
```
Or set in values.yaml:
```yaml
cronjobs:
cleanupStaging:
enabled: true
```
---
## Deployment Validation Commands
### 1. Verify Pod Scheduling
```bash
# Check all pods are on Pi 5 nodes (not Pi 3)
kubectl get pods -n porthole -o wide
# Expected: All pods except optional cronjobs should be on nodes with node-class=compute
```
### 2. Verify Tailscale Endpoints
```bash
# Check Tailscale ingress status
kubectl get ingress -n porthole
# If LoadBalancer service enabled:
kubectl get svc -n porthole -l app.kubernetes.io/component=minio
```
### 3. Verify PVCs
```bash
# Check Longhorn PVCs are created and bound
kubectl get pvc -n porthole
# Check PVC status
kubectl describe pvc -n porthole | grep -A 5 "Status:"
```
### 4. Verify Resource Usage
```bash
# Check current resource usage
kubectl top pods -n porthole
# Check resource requests/limits
kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]} {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}'
```
### 5. Test Presigned URL (HTTPS)
```bash
# Get presigned URL (replace <asset-id>)
curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq .url
# Expected: URL starts with "https://minio.<tailnet-fqdn>..."
# NOT "http://..."
```
### 6. Test Range Request Support
```bash
# Get presigned URL
URL=$(curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq -r .url)
# Test Range request (request first 1KB)
curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null
# Expected: HTTP/1.1 206 Partial Content
# If you see 200 OK, Range requests are not working
```
### 7. Verify Worker Concurrency
```bash
# Check BullMQ configuration in worker
kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency"
# Expected: concurrency: 1 (or at most 2 for Pi hardware)
```
### 8. Test Timeline with Failed Assets
```bash
# Query timeline with failed assets included
curl -sS "https://app.<tailnet-fqdn>/api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)'
# Should return nodes where some assets have status != 'ready'
```
### 9. Database Verification
```bash
# Connect to Postgres
kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole
-- Check failed assets
SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10;
-- Check assets without capture date (should not appear in timeline)
SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL;
-- Verify external originals not copied to canonical
SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL;
-- Should be 0
```
---
## End-to-End Deployment Verification Checklist
### Pre-Deployment
- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-1> node-class=compute`
- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-2> node-class=compute`
- [ ] Verify Pi 3 has taint: `kubectl taint node <pi3-node> capacity=low:NoExecute`
- [ ] Set `global.tailscale.tailnetFQDN` in values.yaml
- [ ] Set secret values (postgres password, minio credentials)
- [ ] Build and push multi-arch images to registry
### Deployment
```bash
# Install Helm chart
helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m
```
### Post-Deployment Verification
- [ ] All pods are running on Pi 5 nodes (check `kubectl get pods -n porthole -o wide`)
- [ ] PVCs are created and bound (`kubectl get pvc -n porthole`)
- [ ] Tailscale endpoints are accessible:
- [ ] `https://app.<tailnet-fqdn>` - web UI loads
- [ ] `https://minio.<tailnet-fqdn>` - MinIO S3 accessible (mc ls)
- [ ] `https://minio-console.<tailnet-fqdn>` - MinIO console loads
- [ ] Presigned URLs use HTTPS and point to tailnet hostname
- [ ] Range requests return 206 Partial Content
- [ ] Upload flow works: `/admin` → upload → asset appears in timeline
- [ ] Scan flow works: trigger scan → `originals/` indexed → timeline populated
- [ ] Failed assets show as placeholders without breaking UI
- [ ] Video playback works for supported codecs; poster shown for unsupported
- [ ] Worker memory usage stays within 2Gi limit during large file processing
- [ ] No mixed-content warnings in browser console
### Performance Validation
- [ ] Timeline tree loads and remains responsive
- [ ] Zoom/pan works smoothly on mobile (test touch)
- [ ] Video seeking works without stutter
- [ ] Worker processes queue without OOM
- [ ] Postgres memory stays within 2Gi
- [ ] MinIO memory stays within 2Gi
---
## High-Risk Areas Summary
| Risk | Impact | Likelihood | Mitigation |
| -------------------------------------- | -------------------------------------- | ---------- | ------------------------------------------------------------------- |
| Pi 3 node receives heavy pod | OOMKilled, cluster instability | Very Low | Required affinity + capacity=low:NoExecute taint prevent scheduling |
| Tailscale Ingress Range request issues | Video seeking broken, poor UX | Medium | Enable `tailscaleServiceS3.enabled: true` for MinIO |
| Worker OOM on large video processing | Worker crashes, queue stalls | Low | Concurrency=1 already set; monitor memory during testing |
| MinIO presigned URL expiration | Videos stop playing mid-session | Low | 900s TTL is reasonable; user can re-open viewer |
| Staging files accumulate | Disk fills up | Medium | Enable `cleanupStaging.enabled: true` |
| Missing error boundaries | Component crashes show unhandled error | Low | Error boundaries now implemented |
---
## Next Steps
1. **Update node affinity** to `required` for compute class (or add anti-affinity for Pi 3)
2. **Enable Tailscale LoadBalancer service** for MinIO S3 for reliable Range requests
3. **Enable cleanup CronJob** after initial testing: `--set cronjobs.cleanupStaging.enabled=true`
4. **Deploy to cluster** and run validation commands
5. **Perform end-to-end testing** with real media (upload + scan)
6. **Monitor resource usage** during typical operations to confirm limits are appropriate