Backup verification
A backup that hasn't been restored isn't a backup. OrbitalReg ships a weekly CronJob that:
- Pulls the most recent base + WAL backup from the Barman target
- Provisions an ephemeral CNPG cluster in a side namespace
- Bootstraps it from the backup (
bootstrap.recovery) - Runs a smoke suite that exercises every API read path
- Writes the result into
backup_verification_runs - Tears the ephemeral cluster down
The result feeds:
- A Prometheus metric (
orbitalreg_backup_verification_age_seconds) - An alert that fires when the metric goes stale (default: > 8 days)
- An admin UI card under Admin → System → Backup health
What's verified
The smoke suite walks the high-value tables and asserts:
| Table | Smoke check |
|---|---|
projects | Row count > 0 |
repositories | Row count matches the production cluster within ±1% |
artifacts | Random sample of 100 rows has matching S3 keys (HEAD round-trip) |
scan_findings | Row count > 0 (sanity) |
users + service_accounts | Auth round-trip succeeds against the restored DB |
webhook_subscriptions | URL + has_secret bit reads match production |
retention_audit | Latest row's created_at is no older than the configured backup window |
Failures are written to backup_verification_runs.failure_reasons as a JSONB array so the admin UI can render which check tripped.
Alert routing
The default PrometheusRule ships with the chart:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: orbitalreg-backup-stale
spec:
groups:
- name: orbitalreg.backup
rules:
- alert: OrbitalRegBackupStale
expr: |
time() - orbitalreg_backup_verification_last_success_timestamp > 8 * 24 * 3600
for: 30m
labels:
severity: warning
annotations:
summary: "OrbitalReg backup verification has not succeeded in over 8 days"
runbook_url: "https://docs.orbitalreg.io/operations/backup-verification"Tuning
values.backupVerification.schedule defaults to 0 4 * * 1 (Mondays at 04:00 UTC). The verification cluster's resource shape matches the production cluster by default — override under values.backupVerification.resources if you'd rather run a smaller verification cluster.
When verification fails
A failed run is not automatically retried. The signal is the alert; the human action is to investigate which check tripped:
kubectl exec -n orbitalreg deploy/orbitalreg-api -- \
/bin/orbitalreg-cli admin backup-verification last-failureCommon failures:
HEAD on S3 key 404— the backup is older than the artifact in question, or S3 retention pruned the key. Review the S3 bucket's lifecycle policy.schema mismatch— the migration set in the backup is not the same as the migration set in the test image. Re-run withverification.image.tagpinned to the same tag the cluster is running.row count diverges by > 1%— usually a Barman / S3 hiccup; inspect the WAL stream for gaps.
Disabling
For tiny installs where the verification cost outweighs the value:
backupVerification:
enabled: falseThe admin UI card then renders a banner suggesting an external check (e.g. an external monitoring service that pings /health/backup before paging).