Backup verification

A backup that hasn't been restored isn't a backup. OrbitalReg ships a weekly CronJob that:

Pulls the most recent base + WAL backup from the Barman target
Provisions an ephemeral CNPG cluster in a side namespace
Bootstraps it from the backup (bootstrap.recovery)
Runs a smoke suite that exercises every API read path
Writes the result into backup_verification_runs
Tears the ephemeral cluster down

The result feeds:

A Prometheus metric (orbitalreg_backup_verification_age_seconds)
An alert that fires when the metric goes stale (default: > 8 days)
An admin UI card under Admin → System → Backup health

What's verified

The smoke suite walks the high-value tables and asserts:

Table	Smoke check
`projects`	Row count > 0
`repositories`	Row count matches the production cluster within ±1%
`artifacts`	Random sample of 100 rows has matching S3 keys (HEAD round-trip)
`scan_findings`	Row count > 0 (sanity)
`users` + `service_accounts`	Auth round-trip succeeds against the restored DB
`webhook_subscriptions`	URL + has_secret bit reads match production
`retention_audit`	Latest row's `created_at` is no older than the configured backup window

Failures are written to backup_verification_runs.failure_reasons as a JSONB array so the admin UI can render which check tripped.

Alert routing

The default PrometheusRule ships with the chart:

yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: orbitalreg-backup-stale
spec:
  groups:
    - name: orbitalreg.backup
      rules:
        - alert: OrbitalRegBackupStale
          expr: |
            time() - orbitalreg_backup_verification_last_success_timestamp > 8 * 24 * 3600
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "OrbitalReg backup verification has not succeeded in over 8 days"
            runbook_url: "https://docs.orbitalreg.io/operations/backup-verification"

Tuning

values.backupVerification.schedule defaults to 0 4 * * 1 (Mondays at 04:00 UTC). The verification cluster's resource shape matches the production cluster by default — override under values.backupVerification.resources if you'd rather run a smaller verification cluster.

When verification fails

A failed run is not automatically retried. The signal is the alert; the human action is to investigate which check tripped:

bash

kubectl exec -n orbitalreg deploy/orbitalreg-api -- \
  /bin/orbitalreg-cli admin backup-verification last-failure

Common failures:

HEAD on S3 key 404 — the backup is older than the artifact in question, or S3 retention pruned the key. Review the S3 bucket's lifecycle policy.
schema mismatch — the migration set in the backup is not the same as the migration set in the test image. Re-run with verification.image.tag pinned to the same tag the cluster is running.
row count diverges by > 1% — usually a Barman / S3 hiccup; inspect the WAL stream for gaps.

Disabling

For tiny installs where the verification cost outweighs the value:

yaml

backupVerification:
  enabled: false

The admin UI card then renders a banner suggesting an external check (e.g. an external monitoring service that pings /health/backup before paging).

Backup verification ​

What's verified ​

Alert routing ​

Tuning ​

When verification fails ​

Disabling ​

Related docs ​