Skip to content

Backup verification

A backup that hasn't been restored isn't a backup. OrbitalReg ships a weekly CronJob that:

  1. Pulls the most recent base + WAL backup from the Barman target
  2. Provisions an ephemeral CNPG cluster in a side namespace
  3. Bootstraps it from the backup (bootstrap.recovery)
  4. Runs a smoke suite that exercises every API read path
  5. Writes the result into backup_verification_runs
  6. Tears the ephemeral cluster down

The result feeds:

  • A Prometheus metric (orbitalreg_backup_verification_age_seconds)
  • An alert that fires when the metric goes stale (default: > 8 days)
  • An admin UI card under Admin → System → Backup health

What's verified

The smoke suite walks the high-value tables and asserts:

TableSmoke check
projectsRow count > 0
repositoriesRow count matches the production cluster within ±1%
artifactsRandom sample of 100 rows has matching S3 keys (HEAD round-trip)
scan_findingsRow count > 0 (sanity)
users + service_accountsAuth round-trip succeeds against the restored DB
webhook_subscriptionsURL + has_secret bit reads match production
retention_auditLatest row's created_at is no older than the configured backup window

Failures are written to backup_verification_runs.failure_reasons as a JSONB array so the admin UI can render which check tripped.

Alert routing

The default PrometheusRule ships with the chart:

yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: orbitalreg-backup-stale
spec:
  groups:
    - name: orbitalreg.backup
      rules:
        - alert: OrbitalRegBackupStale
          expr: |
            time() - orbitalreg_backup_verification_last_success_timestamp > 8 * 24 * 3600
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "OrbitalReg backup verification has not succeeded in over 8 days"
            runbook_url: "https://docs.orbitalreg.io/operations/backup-verification"

Tuning

values.backupVerification.schedule defaults to 0 4 * * 1 (Mondays at 04:00 UTC). The verification cluster's resource shape matches the production cluster by default — override under values.backupVerification.resources if you'd rather run a smaller verification cluster.

When verification fails

A failed run is not automatically retried. The signal is the alert; the human action is to investigate which check tripped:

bash
kubectl exec -n orbitalreg deploy/orbitalreg-api -- \
  /bin/orbitalreg-cli admin backup-verification last-failure

Common failures:

  • HEAD on S3 key 404 — the backup is older than the artifact in question, or S3 retention pruned the key. Review the S3 bucket's lifecycle policy.
  • schema mismatch — the migration set in the backup is not the same as the migration set in the test image. Re-run with verification.image.tag pinned to the same tag the cluster is running.
  • row count diverges by > 1% — usually a Barman / S3 hiccup; inspect the WAL stream for gaps.

Disabling

For tiny installs where the verification cost outweighs the value:

yaml
backupVerification:
  enabled: false

The admin UI card then renders a banner suggesting an external check (e.g. an external monitoring service that pings /health/backup before paging).

Released under the Apache-2.0 License.