Disaster recovery

This page is the externally-rendered companion to docs/operations/disaster-recovery.md. The in-repo runbook is the canonical engineering source — keep it open during an actual incident, since it carries the full copy-pasteable command sequences. This page summarises the model so an evaluating engineer can read what's in the box without cloning the repo.

What's protected

OrbitalReg's persistent state is split across three failure domains:

Postgres — every governance row, scan finding, retention audit, service account, license policy.
S3-compatible bucket — every artifact byte. The Docker manifests live here too, although their indices are derivable from Postgres.
(Optional) Redis — session state, rate-limit counters, metadata cache. Not backed up by design — losing Redis logs users out and re-warms a cold cache, but doesn't lose data.

The DR runbook covers Postgres + S3. Redis is intentionally volatile.

Backup mechanism

Component	Tool	Schedule	Retention
Postgres	CloudNativePG + Barman	Continuous WAL + daily base	30 days (PITR)
S3	Provider replication	Per-write	30 days

The Postgres backup target is itself an S3 bucket, kept separate from the artifact bucket. Co-locating them is fine for small installs; for production we recommend cross-region for the backup bucket so a region-level outage doesn't take both.

Three scenarios

1. Postgres compromised, S3 healthy

Most common. Symptoms:

/health/ready returns 503 with db: down
CNPG cluster reports phase: Failed
relation "<table>" does not exist errors in API logs

Action: PITR the CNPG cluster to the last-known-good point. The scripts/orbital-restore.sh wrapper handles the common case:

bash

./scripts/orbital-restore.sh \
  --scenario db-only \
  --target-time "2026-04-30T14:32:00Z" \
  --namespace orbitalreg

Smoke-test post-restore:

bash

./scripts/orbital-restore.sh --smoke --namespace orbitalreg

The smoke suite touches each format adapter's read path and verifies that artifact metadata round-trips against S3 (no orphan rows referencing deleted blobs and vice versa).

2. S3 compromised, Postgres healthy

Action: failover to the replica bucket via the chart's s3.endpoint / s3.bucket knobs. The API picks up the new endpoint on rollout. Postgres rows are preserved; any artifacts uploaded during the outage window need to be re-pushed (the scan_jobs.queued table reveals which).

3. Total loss

Both Postgres and S3 are gone. Action sequence:

Restore S3 from the replica or from the last-known-good snapshot
Restore Postgres from CNPG backup
Reconcile — the smoke suite flags any rows referencing missing S3 keys; either re-fetch from upstream (for remote-cached repos) or delete the orphan rows

RPO / RTO targets

Metric	Target	Notes
RPO	≤ 5 minutes (PITR)	CNPG WAL streams continuously
RTO	≤ 30 minutes (DB)	New CNPG cluster bootstrap from backup is the long pole
RTO	≤ 5 minutes (S3)	Bucket failover is endpoint-config + a rollout
RTO	≤ 60 minutes (total)	Sequential DB-then-S3 restore plus smoke-suite verification

Pre-incident checklist

A DR drill should run quarterly. Each drill:

Restore the last successful weekly backup into an ephemeral cluster
Run the smoke suite
Compare row counts on key tables vs. the production cluster
Document the wall-clock time and any deviations

The Backup verification job automates step 1+2 weekly; quarterly drills are about exercising the human runbook under pressure.

Backup verification — weekly automated restore-into-ephemeral-cluster
Postgres on CloudNativePG — the recommended production Postgres shape
docs/operations/disaster-recovery.md — full runbook with every command

Disaster recovery ​

What's protected ​

Backup mechanism ​

Three scenarios ​

1. Postgres compromised, S3 healthy ​

2. S3 compromised, Postgres healthy ​

3. Total loss ​

RPO / RTO targets ​

Pre-incident checklist ​

Related docs ​