Skip to content

Disaster recovery

This page is the externally-rendered companion to docs/operations/disaster-recovery.md. The in-repo runbook is the canonical engineering source — keep it open during an actual incident, since it carries the full copy-pasteable command sequences. This page summarises the model so an evaluating engineer can read what's in the box without cloning the repo.

What's protected

OrbitalReg's persistent state is split across three failure domains:

  1. Postgres — every governance row, scan finding, retention audit, service account, license policy.
  2. S3-compatible bucket — every artifact byte. The Docker manifests live here too, although their indices are derivable from Postgres.
  3. (Optional) Redis — session state, rate-limit counters, metadata cache. Not backed up by design — losing Redis logs users out and re-warms a cold cache, but doesn't lose data.

The DR runbook covers Postgres + S3. Redis is intentionally volatile.

Backup mechanism

ComponentToolScheduleRetention
PostgresCloudNativePG + BarmanContinuous WAL + daily base30 days (PITR)
S3Provider replicationPer-write30 days

The Postgres backup target is itself an S3 bucket, kept separate from the artifact bucket. Co-locating them is fine for small installs; for production we recommend cross-region for the backup bucket so a region-level outage doesn't take both.

Three scenarios

1. Postgres compromised, S3 healthy

Most common. Symptoms:

  • /health/ready returns 503 with db: down
  • CNPG cluster reports phase: Failed
  • relation "<table>" does not exist errors in API logs

Action: PITR the CNPG cluster to the last-known-good point. The scripts/orbital-restore.sh wrapper handles the common case:

bash
./scripts/orbital-restore.sh \
  --scenario db-only \
  --target-time "2026-04-30T14:32:00Z" \
  --namespace orbitalreg

Smoke-test post-restore:

bash
./scripts/orbital-restore.sh --smoke --namespace orbitalreg

The smoke suite touches each format adapter's read path and verifies that artifact metadata round-trips against S3 (no orphan rows referencing deleted blobs and vice versa).

2. S3 compromised, Postgres healthy

Action: failover to the replica bucket via the chart's s3.endpoint / s3.bucket knobs. The API picks up the new endpoint on rollout. Postgres rows are preserved; any artifacts uploaded during the outage window need to be re-pushed (the scan_jobs.queued table reveals which).

3. Total loss

Both Postgres and S3 are gone. Action sequence:

  1. Restore S3 from the replica or from the last-known-good snapshot
  2. Restore Postgres from CNPG backup
  3. Reconcile — the smoke suite flags any rows referencing missing S3 keys; either re-fetch from upstream (for remote-cached repos) or delete the orphan rows

RPO / RTO targets

MetricTargetNotes
RPO≤ 5 minutes (PITR)CNPG WAL streams continuously
RTO≤ 30 minutes (DB)New CNPG cluster bootstrap from backup is the long pole
RTO≤ 5 minutes (S3)Bucket failover is endpoint-config + a rollout
RTO≤ 60 minutes (total)Sequential DB-then-S3 restore plus smoke-suite verification

Pre-incident checklist

A DR drill should run quarterly. Each drill:

  1. Restore the last successful weekly backup into an ephemeral cluster
  2. Run the smoke suite
  3. Compare row counts on key tables vs. the production cluster
  4. Document the wall-clock time and any deviations

The Backup verification job automates step 1+2 weekly; quarterly drills are about exercising the human runbook under pressure.

Released under the Apache-2.0 License.