feat(heartbeat): add provider error-rate spike check
This commit is contained in:
@@ -652,7 +652,7 @@ automation:
|
||||
heartbeat:
|
||||
enabled: true
|
||||
interval: "5m" # Check every 5 minutes
|
||||
checks: [gateway, model, channels, memory, disk, process_memory, backup]
|
||||
checks: [gateway, model, channels, memory, disk, process_memory, backup, provider_errors]
|
||||
notify:
|
||||
channel: telegram
|
||||
peer: "123456789"
|
||||
@@ -660,6 +660,8 @@ automation:
|
||||
disk_threshold_mb: 100 # Warn when <100MB free
|
||||
process_memory_threshold_mb: 1500 # Warn when RSS memory exceeds threshold
|
||||
backup_failure_threshold: 1 # Warn when backup failures meet threshold
|
||||
provider_error_rate_threshold: 0.5 # Warn when provider error rate >= threshold
|
||||
provider_error_min_calls: 5 # Minimum model calls per provider before evaluation
|
||||
```
|
||||
|
||||
### Heartbeat Checks
|
||||
@@ -673,6 +675,7 @@ automation:
|
||||
| `disk` | Free disk space exceeds threshold |
|
||||
| `process_memory` | Flynn process RSS memory usage stays under threshold |
|
||||
| `backup` | Backup scheduler consecutive failures stay under threshold |
|
||||
| `provider_errors` | Model provider error rates stay below threshold |
|
||||
|
||||
The monitor sends a notification when failures reach the configured threshold and a recovery notification when all checks pass again.
|
||||
|
||||
@@ -689,6 +692,8 @@ The monitor sends a notification when failures reach the configured threshold an
|
||||
| `disk_threshold_mb` | no | Disk space warning threshold in MB (default: `100`) |
|
||||
| `process_memory_threshold_mb` | no | RSS memory threshold in MB for `process_memory` check (default: `1500`) |
|
||||
| `backup_failure_threshold` | no | Consecutive backup failures threshold for `backup` check (default: `1`) |
|
||||
| `provider_error_rate_threshold` | no | Error-rate threshold (0..1) for `provider_errors` check (default: `0.5`) |
|
||||
| `provider_error_min_calls` | no | Minimum provider calls before applying error-rate threshold (default: `5`) |
|
||||
|
||||
## Gmail Pub/Sub Watcher
|
||||
|
||||
|
||||
Reference in New Issue
Block a user