Disaster recovery
This page is the externally-rendered companion to docs/operations/disaster-recovery.md. The in-repo runbook is the canonical engineering source — keep it open during an actual incident, since it carries the full copy-pasteable command sequences. This page summarises the model so an evaluating engineer can read what's in the box without cloning the repo.
What's protected
OrbitalReg's persistent state is split across three failure domains:
- Postgres — every governance row, scan finding, retention audit, service account, license policy.
- S3-compatible bucket — every artifact byte. The Docker manifests live here too, although their indices are derivable from Postgres.
- (Optional) Redis — session state, rate-limit counters, metadata cache. Not backed up by design — losing Redis logs users out and re-warms a cold cache, but doesn't lose data.
The DR runbook covers Postgres + S3. Redis is intentionally volatile.
Backup mechanism
| Component | Tool | Schedule | Retention |
|---|---|---|---|
| Postgres | CloudNativePG + Barman | Continuous WAL + daily base | 30 days (PITR) |
| S3 | Provider replication | Per-write | 30 days |
The Postgres backup target is itself an S3 bucket, kept separate from the artifact bucket. Co-locating them is fine for small installs; for production we recommend cross-region for the backup bucket so a region-level outage doesn't take both.
Three scenarios
1. Postgres compromised, S3 healthy
Most common. Symptoms:
/health/readyreturns 503 withdb: down- CNPG cluster reports
phase: Failed relation "<table>" does not existerrors in API logs
Action: PITR the CNPG cluster to the last-known-good point. The scripts/orbital-restore.sh wrapper handles the common case:
./scripts/orbital-restore.sh \
--scenario db-only \
--target-time "2026-04-30T14:32:00Z" \
--namespace orbitalregSmoke-test post-restore:
./scripts/orbital-restore.sh --smoke --namespace orbitalregThe smoke suite touches each format adapter's read path and verifies that artifact metadata round-trips against S3 (no orphan rows referencing deleted blobs and vice versa).
2. S3 compromised, Postgres healthy
Action: failover to the replica bucket via the chart's s3.endpoint / s3.bucket knobs. The API picks up the new endpoint on rollout. Postgres rows are preserved; any artifacts uploaded during the outage window need to be re-pushed (the scan_jobs.queued table reveals which).
3. Total loss
Both Postgres and S3 are gone. Action sequence:
- Restore S3 from the replica or from the last-known-good snapshot
- Restore Postgres from CNPG backup
- Reconcile — the smoke suite flags any rows referencing missing S3 keys; either re-fetch from upstream (for remote-cached repos) or delete the orphan rows
RPO / RTO targets
| Metric | Target | Notes |
|---|---|---|
| RPO | ≤ 5 minutes (PITR) | CNPG WAL streams continuously |
| RTO | ≤ 30 minutes (DB) | New CNPG cluster bootstrap from backup is the long pole |
| RTO | ≤ 5 minutes (S3) | Bucket failover is endpoint-config + a rollout |
| RTO | ≤ 60 minutes (total) | Sequential DB-then-S3 restore plus smoke-suite verification |
Pre-incident checklist
A DR drill should run quarterly. Each drill:
- Restore the last successful weekly backup into an ephemeral cluster
- Run the smoke suite
- Compare row counts on key tables vs. the production cluster
- Document the wall-clock time and any deviations
The Backup verification job automates step 1+2 weekly; quarterly drills are about exercising the human runbook under pressure.
Related docs
- Backup verification — weekly automated restore-into-ephemeral-cluster
- Postgres on CloudNativePG — the recommended production Postgres shape
docs/operations/disaster-recovery.md— full runbook with every command