Runbooks
Each PrometheusRule alert shipped by the chart links here through the runbook_url annotation. Open the matching playbook before opening the dashboard — every page follows the same shape so you can read it under page-pressure:
- What it means — one paragraph in plain English.
- Likely causes — ranked by frequency.
- Diagnosis — copy-pasteable
kubectl/psqlsnippets. - Fixes — quick-fix first, nuke-and-restore last.
- Escalation — when to wake somebody else up.
The bundle is opt-in through the chart's observability.alerts.enabled toggle. See Helm chart values for the full list of overrides — every alert can be silenced individually, every threshold tuned per environment.
The ten alerts
Critical (page immediately)
| Alert | Triggers when | Runbook |
|---|---|---|
OrbitalRegAPIDown | API target absent from Prometheus for 3 min | open |
OrbitalRegDBDown | Postgres target absent from Prometheus for 2 min | open |
OrbitalRegS3MirrorFailing | Backup S3 endpoint has pending retry queue for 30 min | open |
OrbitalRegHighErrorRate | API 5xx ratio > 5% for 5 min | open |
OrbitalRegSAMLDown | SAML IdP unreachable for 5 min | open |
Warning (Slack during business hours)
| Alert | Triggers when | Runbook |
|---|---|---|
OrbitalRegBackupStale | No successful backup verification in 8 days | open |
OrbitalRegScanQueueBacklog | Detection queue depth > 1000 for 10 min | open |
OrbitalRegCertExpiring | An operator-managed cert is < 14 days from expiry | open |
OrbitalRegLicenseExpiring | The active license envelope is < 30 days from expiry | open |
Info (dashboard only)
| Alert | Triggers when | Runbook |
|---|---|---|
OrbitalRegHighDiskUsage | An OrbitalReg PVC > 85% full for 10 min | open |
Severity policy
The severities ship as Prometheus labels (severity: critical | warning | info); they're meant to feed AlertManager routing rules the operator owns, not to dictate them. A typical mapping:
route:
routes:
- matchers: [severity="critical"]
receiver: pagerduty-onCall
- matchers: [severity="warning"]
receiver: slack-platform
- matchers: [severity="info"]
receiver: slack-platform-noisyDisabling individual alerts
Every alert has an enabled toggle:
observability:
alerts:
enabled: true
rules:
backupStale:
enabled: false # the dedicated backup-verify PrometheusRule
# already covers this
highDiskUsage:
enabled: true
thresholdPct: 90 # tune per clusterbackupStale ships disabled by default in the bundle because the chart's separate backup-verify-prometheusrule.yaml (gated by backupVerification.alert.enabled) already publishes the equivalent rule with the same expression. Flip the bundle's copy on if you run the verifier in a sibling cluster that scrapes this Prometheus.