porthole/DEPLOYMENT_VALIDATION.md

# Task 11 - Kubernetes Deployment Validation Report

## Configuration Review Summary

### ✅ Correctly Configured

#### 1. Tailscale Ingress

All three ingress resources are properly defined:

- **App** (`app.<tailnet-fqdn>`) → web service port 3000
- **MinIO S3** (`minio.<tailnet-fqdn>`) → MinIO port 9000
- **MinIO Console** (`minio-console.<tailnet-fqdn>`) → MinIO console port 9001

Each ingress correctly:

- Uses Tailscale ingress class
- Configures TLS with the appropriate hostname
- Routes to the correct service and port

#### 2. Tailscale Service Option (LoadBalancer)

Alternative exposure method via Tailscale LoadBalancer is available:

- `helm/porthole/templates/service-minio-tailscale-s3.yaml.tpl` - S3 API at `minio.<tailnet-fqdn>`
- `helm/porthole/templates/service-minio-tailscale-console.yaml.tpl` - Console at `minio-console.<tailnet-fqdn>`

Currently disabled (`minio.tailscaleServiceS3.enabled: false` in values.yaml).

#### 3. Node Scheduling

All heavy workloads are configured with `schedulingClass: compute`:

- web (1Gi limit)
- worker (2Gi limit)
- postgres (2Gi limit)
- redis (512Mi limit)
- minio (2Gi limit)

The scheduling helper (`_helpers.tpl:40-46`) applies the `scheduling.compute.affinity` which prefers nodes labeled with `node-class=compute`.

#### 4. Longhorn PVCs

Both stateful workloads use Longhorn PVCs:

- Postgres: 20Gi storage
- MinIO: 200Gi storage

#### 5. Resource Limits

All workloads have appropriate resource requests and limits for Pi hardware:

- Web: 200m CPU / 256Mi → 1000m CPU / 1Gi
- Worker: 500m CPU / 1Gi → 2000m CPU / 2Gi
- Postgres: 500m CPU / 1Gi → 1500m CPU / 2Gi
- Redis: 50m CPU / 128Mi → 300m CPU / 512Mi
- MinIO: 250m CPU / 512Mi → 1500m CPU / 2Gi

#### 6. Cleanup CronJob

Staging cleanup is properly configured but disabled by default:

- Only targets `staging/` prefix (safe, never touches `originals/`)
- Removes files older than 14 days
- Must be enabled manually: `cronjobs.cleanupStaging.enabled: true`

---

### ⚠️ Issues & Recommendations

#### 1. Node Affinity Now Uses "Required"

**Status:** ✅ Fixed - Affinity changed to `requiredDuringSchedulingIgnoredDuringExecution`.

All heavy workloads now require `node-class=compute` nodes (Pi 5). The Pi 3 node is tainted with `capacity=low:NoExecute`, which provides an additional safeguard preventing any pods from being scheduled on it.

**Alternative:** Keep preferred affinity but add anti-affinity for Pi 3 node (requires labeling Pi 3 with `node-class=tiny`):

```yaml
scheduling:
  compute:
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
                - key: node-class
                  operator: In
                  values:
                    - compute
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: node-class
                  operator: NotIn
                  values:
                    - tiny
```

#### 2. No Range Request Optimizations on Ingress

**Issue:** The Tailscale ingress resources (`ingress-tailscale.yaml.tpl`) don't have annotations for proxy timeout or buffer settings that are important for video streaming and Range requests.

**Risk:** Video seeking may be unreliable or fail for large files through Tailscale Ingress.

**Recommendation 1 (Preferred):** Enable Tailscale LoadBalancer Service for MinIO S3 instead of Ingress. This provides a more direct connection for streaming:

```yaml
# In values.yaml
minio:
  tailscaleServiceS3:
    enabled: true
    hostnameLabel: minio
```

This will:

- Create a LoadBalancer service accessible via `https://minio.<tailnet-fqdn>`
- Provide more reliable Range request support
- Bypass potential ingress buffering issues

**Recommendation 2 (If using Ingress):** Add custom annotations for timeout/buffer optimization. Add to `values.yaml`:

```yaml
minio:
  ingressS3:
    extraAnnotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "500m"
      nginx.ingress.kubernetes.io/proxy-request-buffering: "off"
      nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "0"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
```

Note: These annotations are specific to nginx ingress. If using Tailscale ingress, check Tailscale documentation for equivalent settings.

#### 3. Cleanup CronJob Disabled by Default

**Issue:** `cronjobs.cleanupStaging.enabled: false` in values.yaml means old staging files will accumulate indefinitely.

**Risk:** Staging files from failed/interrupted uploads will fill up MinIO PVC over time.

**Recommendation:** Enable cleanup after initial testing:

```bash
helm upgrade --install porthole helm/porthole -f values.yaml \
  --set cronjobs.cleanupStaging.enabled=true
```

Or set in values.yaml:

```yaml
cronjobs:
  cleanupStaging:
    enabled: true
```

---

## Deployment Validation Commands

### 1. Verify Pod Scheduling

```bash
# Check all pods are on Pi 5 nodes (not Pi 3)
kubectl get pods -n porthole -o wide

# Expected: All pods except optional cronjobs should be on nodes with node-class=compute
```

### 2. Verify Tailscale Endpoints

```bash
# Check Tailscale ingress status
kubectl get ingress -n porthole

# If LoadBalancer service enabled:
kubectl get svc -n porthole -l app.kubernetes.io/component=minio
```

### 3. Verify PVCs

```bash
# Check Longhorn PVCs are created and bound
kubectl get pvc -n porthole

# Check PVC status
kubectl describe pvc -n porthole | grep -A 5 "Status:"
```

### 4. Verify Resource Usage

```bash
# Check current resource usage
kubectl top pods -n porthole

# Check resource requests/limits
kubectl get pods -n porthole -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.containers[*]}  {.name}: CPU={.resources.requests.cpu}→{.resources.limits.cpu}, MEM={.resources.requests.memory}→{.resources.limits.memory}{"\n"}{end}{"\n"}{end}'
```

### 5. Test Presigned URL (HTTPS)

```bash
# Get presigned URL (replace <asset-id>)
curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq .url

# Expected: URL starts with "https://minio.<tailnet-fqdn>..."
# NOT "http://..."
```

### 6. Test Range Request Support

```bash
# Get presigned URL
URL=$(curl -sS "https://app.<tailnet-fqdn>/api/assets/<asset-id>/url?variant=original" | jq -r .url)

# Test Range request (request first 1KB)
curl -sS -D- -H 'Range: bytes=0-1023' "$URL" -o /dev/null

# Expected: HTTP/1.1 206 Partial Content
# If you see 200 OK, Range requests are not working
```

### 7. Verify Worker Concurrency

```bash
# Check BullMQ configuration in worker
kubectl exec -n porthole deployment/porthole-worker -- cat /app/src/index.ts | grep -A 5 "concurrency"

# Expected: concurrency: 1 (or at most 2 for Pi hardware)
```

### 8. Test Timeline with Failed Assets

```bash
# Query timeline with failed assets included
curl -sS "https://app.<tailnet-fqdn>/api/tree?includeFailed=1" | jq '.nodes[] | select(.count_ready < .count_total)'

# Should return nodes where some assets have status != 'ready'
```

### 9. Database Verification

```bash
# Connect to Postgres
kubectl exec -it -n porthole statefulset/porthole-postgres -- psql -U porthole -d porthole

-- Check failed assets
SELECT id, media_type, status, error_message, date_confidence FROM assets WHERE status = 'failed' LIMIT 10;

-- Check assets without capture date (should not appear in timeline)
SELECT COUNT(*) FROM assets WHERE capture_ts_utc IS NULL;

-- Verify external originals not copied to canonical
SELECT COUNT(*) FROM assets WHERE source_key LIKE 'originals/%' AND canonical_key IS NOT NULL;
-- Should be 0
```

---

## End-to-End Deployment Verification Checklist

### Pre-Deployment

- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-1> node-class=compute`
- [ ] Label Pi 5 nodes: `kubectl label node <pi5-node-2> node-class=compute`
- [ ] Verify Pi 3 has taint: `kubectl taint node <pi3-node> capacity=low:NoExecute`
- [ ] Set `global.tailscale.tailnetFQDN` in values.yaml
- [ ] Set secret values (postgres password, minio credentials)
- [ ] Build and push multi-arch images to registry

### Deployment

```bash
# Install Helm chart
helm install porthole helm/porthole -f values.yaml --namespace porthole --create-namespace

# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=porthole -n porthole --timeout=10m
```

### Post-Deployment Verification

- [ ] All pods are running on Pi 5 nodes (check `kubectl get pods -n porthole -o wide`)
- [ ] PVCs are created and bound (`kubectl get pvc -n porthole`)
- [ ] Tailscale endpoints are accessible:
  - [ ] `https://app.<tailnet-fqdn>` - web UI loads
  - [ ] `https://minio.<tailnet-fqdn>` - MinIO S3 accessible (mc ls)
  - [ ] `https://minio-console.<tailnet-fqdn>` - MinIO console loads
- [ ] Presigned URLs use HTTPS and point to tailnet hostname
- [ ] Range requests return 206 Partial Content
- [ ] Upload flow works: `/admin` → upload → asset appears in timeline
- [ ] Scan flow works: trigger scan → `originals/` indexed → timeline populated
- [ ] Failed assets show as placeholders without breaking UI
- [ ] Video playback works for supported codecs; poster shown for unsupported
- [ ] Worker memory usage stays within 2Gi limit during large file processing
- [ ] No mixed-content warnings in browser console

### Performance Validation

- [ ] Timeline tree loads and remains responsive
- [ ] Zoom/pan works smoothly on mobile (test touch)
- [ ] Video seeking works without stutter
- [ ] Worker processes queue without OOM
- [ ] Postgres memory stays within 2Gi
- [ ] MinIO memory stays within 2Gi

---

## High-Risk Areas Summary

| Risk                                   | Impact                                 | Likelihood | Mitigation                                                          |
| -------------------------------------- | -------------------------------------- | ---------- | ------------------------------------------------------------------- |
| Pi 3 node receives heavy pod           | OOMKilled, cluster instability         | Very Low   | Required affinity + capacity=low:NoExecute taint prevent scheduling |
| Tailscale Ingress Range request issues | Video seeking broken, poor UX          | Medium     | Enable `tailscaleServiceS3.enabled: true` for MinIO                 |
| Worker OOM on large video processing   | Worker crashes, queue stalls           | Low        | Concurrency=1 already set; monitor memory during testing            |
| MinIO presigned URL expiration         | Videos stop playing mid-session        | Low        | 900s TTL is reasonable; user can re-open viewer                     |
| Staging files accumulate               | Disk fills up                          | Medium     | Enable `cleanupStaging.enabled: true`                               |
| Missing error boundaries               | Component crashes show unhandled error | Low        | Error boundaries now implemented                                    |

---

## Next Steps

1. **Update node affinity** to `required` for compute class (or add anti-affinity for Pi 3)
2. **Enable Tailscale LoadBalancer service** for MinIO S3 for reliable Range requests
3. **Enable cleanup CronJob** after initial testing: `--set cronjobs.cleanupStaging.enabled=true`
4. **Deploy to cluster** and run validation commands
5. **Perform end-to-end testing** with real media (upload + scan)
6. **Monitor resource usage** during typical operations to confirm limits are appropriate