feat(heartbeat): add provider error-rate spike check

This commit is contained in:
William Valentin
2026-02-16 13:52:40 -08:00
parent 07340ff0af
commit 71af3b5a42
8 changed files with 120 additions and 6 deletions
+6 -1
View File
@@ -652,7 +652,7 @@ automation:
heartbeat:
enabled: true
interval: "5m" # Check every 5 minutes
checks: [gateway, model, channels, memory, disk, process_memory, backup]
checks: [gateway, model, channels, memory, disk, process_memory, backup, provider_errors]
notify:
channel: telegram
peer: "123456789"
@@ -660,6 +660,8 @@ automation:
disk_threshold_mb: 100 # Warn when <100MB free
process_memory_threshold_mb: 1500 # Warn when RSS memory exceeds threshold
backup_failure_threshold: 1 # Warn when backup failures meet threshold
provider_error_rate_threshold: 0.5 # Warn when provider error rate >= threshold
provider_error_min_calls: 5 # Minimum model calls per provider before evaluation
```
### Heartbeat Checks
@@ -673,6 +675,7 @@ automation:
| `disk` | Free disk space exceeds threshold |
| `process_memory` | Flynn process RSS memory usage stays under threshold |
| `backup` | Backup scheduler consecutive failures stay under threshold |
| `provider_errors` | Model provider error rates stay below threshold |
The monitor sends a notification when failures reach the configured threshold and a recovery notification when all checks pass again.
@@ -689,6 +692,8 @@ The monitor sends a notification when failures reach the configured threshold an
| `disk_threshold_mb` | no | Disk space warning threshold in MB (default: `100`) |
| `process_memory_threshold_mb` | no | RSS memory threshold in MB for `process_memory` check (default: `1500`) |
| `backup_failure_threshold` | no | Consecutive backup failures threshold for `backup` check (default: `1`) |
| `provider_error_rate_threshold` | no | Error-rate threshold (0..1) for `provider_errors` check (default: `0.5`) |
| `provider_error_min_calls` | no | Minimum provider calls before applying error-rate threshold (default: `5`) |
## Gmail Pub/Sub Watcher