OrbitalRegBackupStale
Severity: warning · For: 15m · Runbook owner: platform on-call
What it means
The bundled backup-verification CronJob has not produced a status='pass' row for more than staleAfterSeconds (default 8 days, chosen to cover one missed weekly tick plus a grace window).
The job restores the latest base backup into an ephemeral CNPG cluster and runs a smoke-test suite against it; if it has not run, an unverified backup is on disk. That's not a backup, that's a hope.
This alert ships disabled by default in the alerts bundle — the chart's separate backup-verify-prometheusrule.yaml (gated by backupVerification.alert.enabled) already publishes the equivalent rule. It's here for clusters that scrape verification metrics from a sibling cluster.
Likely causes
- The CronJob is failing (image pull, RBAC, expired admin token).
- The ephemeral Cluster CR cannot be created (CNPG operator down, ResourceQuota exhausted).
- Restore from Barman is timing out — S3 throughput dropped, base backup is too large for
activeDeadlineSeconds. - Smoke tests are failing on the restored DB — the schema migrated but the seed-test data didn't.
Diagnose
# Most recent CronJob runs
kubectl -n orbitalreg get jobs -l app.kubernetes.io/component=backup-verification \
--sort-by=.metadata.creationTimestamp | tail -5
# Logs from the last failed run
kubectl -n orbitalreg logs job/<job-name> --tail=200
# Token validity (replace with real secret)
kubectl -n orbitalreg get secret orbitalreg-backup-verify-token -o jsonpath='{.data.token}' \
| base64 -d | head -c 40 ; echo
# Verification history from the API
curl -sf -H "Authorization: Bearer $TOKEN" \
https://orbitalreg.example.com/api/admin/backup-verifications/history \
| jq '.[0:5]'The Admin → Backups page renders the same history with a green/red badge per run, plus the Barman base-backup metadata.
Fix
- Token expired — mint a fresh org-admin token, rotate the
backup-verify-tokenSecret, the next CronJob tick picks it up. - Restore timeout — bump
backupVerification.activeDeadlineSecondsandtimeoutSeconds. Default 1h covers a 30 GiB DB; larger DBs need proportionally more. - Schedule slipped —
kubectl create job --from=cronjob/<name> manual-verify-$(date +%s)to re-run immediately. If it passes, the gauge bumps and the alert clears. - Smoke test regression — see logs for the failing assertion. The smoke-test scripts live in the chart's
backup-verify-scripts.yamlConfigMap.
Escalate
- Three consecutive scheduled runs failed with the same error → page the platform-DB team. Likely a Barman corruption or S3 path issue that requires inspecting the upstream backup directly.
- The verification table has a
status='pass'row but the gauge hasn't bumped → restart the API; the gauge primes from the table on boot.