Skip to content

OrbitalRegBackupStale

Severity: warning · For: 15m · Runbook owner: platform on-call

What it means

The bundled backup-verification CronJob has not produced a status='pass' row for more than staleAfterSeconds (default 8 days, chosen to cover one missed weekly tick plus a grace window).

The job restores the latest base backup into an ephemeral CNPG cluster and runs a smoke-test suite against it; if it has not run, an unverified backup is on disk. That's not a backup, that's a hope.

This alert ships disabled by default in the alerts bundle — the chart's separate backup-verify-prometheusrule.yaml (gated by backupVerification.alert.enabled) already publishes the equivalent rule. It's here for clusters that scrape verification metrics from a sibling cluster.

Likely causes

  1. The CronJob is failing (image pull, RBAC, expired admin token).
  2. The ephemeral Cluster CR cannot be created (CNPG operator down, ResourceQuota exhausted).
  3. Restore from Barman is timing out — S3 throughput dropped, base backup is too large for activeDeadlineSeconds.
  4. Smoke tests are failing on the restored DB — the schema migrated but the seed-test data didn't.

Diagnose

bash
# Most recent CronJob runs
kubectl -n orbitalreg get jobs -l app.kubernetes.io/component=backup-verification \
  --sort-by=.metadata.creationTimestamp | tail -5

# Logs from the last failed run
kubectl -n orbitalreg logs job/<job-name> --tail=200

# Token validity (replace with real secret)
kubectl -n orbitalreg get secret orbitalreg-backup-verify-token -o jsonpath='{.data.token}' \
  | base64 -d | head -c 40 ; echo

# Verification history from the API
curl -sf -H "Authorization: Bearer $TOKEN" \
  https://orbitalreg.example.com/api/admin/backup-verifications/history \
  | jq '.[0:5]'

The Admin → Backups page renders the same history with a green/red badge per run, plus the Barman base-backup metadata.

Fix

  1. Token expired — mint a fresh org-admin token, rotate the backup-verify-token Secret, the next CronJob tick picks it up.
  2. Restore timeout — bump backupVerification.activeDeadlineSeconds and timeoutSeconds. Default 1h covers a 30 GiB DB; larger DBs need proportionally more.
  3. Schedule slippedkubectl create job --from=cronjob/<name> manual-verify-$(date +%s) to re-run immediately. If it passes, the gauge bumps and the alert clears.
  4. Smoke test regression — see logs for the failing assertion. The smoke-test scripts live in the chart's backup-verify-scripts.yaml ConfigMap.

Escalate

  • Three consecutive scheduled runs failed with the same error → page the platform-DB team. Likely a Barman corruption or S3 path issue that requires inspecting the upstream backup directly.
  • The verification table has a status='pass' row but the gauge hasn't bumped → restart the API; the gauge primes from the table on boot.

Released under the Apache-2.0 License.