OrbitalRegS3MirrorFailing
Severity: critical · For: 30m · Runbook owner: platform on-call
What it means
The optional dual-write S3 mirror (configured via storage.backupEndpoint plus its own credentials) is failing to copy objects from the primary bucket to the backup bucket. The retry queue has had at least one failure pending for over 30 minutes.
Primary writes still succeed — customer uploads aren't failing — but the disaster-recovery property the mirror buys you (lose primary S3, fall over to backup) no longer holds.
The metric orbitalreg_s3_backup_pending_failures is exported by the storage subsystem; it counts objects that have failed at least once and have not yet been replicated successfully.
Likely causes
- Backup endpoint credentials rotated upstream without the chart's Secret being updated.
- Backup bucket out of quota / lifecycle policy is deleting objects the mirror is trying to write.
- Network partition between the cluster and the backup region.
- TLS chain mismatch — backup endpoint presents a cert the API's trust store doesn't validate.
- The primary write succeeded but a
Content-MD5header mismatch is causing the backup put to fail.
Diagnose
# Pending-failure count (gauge)
curl -sf https://orbitalreg.example.com/metrics \
| grep ^orbitalreg_s3_backup_pending_failures
# Last 50 mirror errors from the API logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=500 \
| grep -i 'backup mirror' | tail -50
# Manual probe of the backup endpoint with the chart-mounted creds
kubectl -n orbitalreg exec deploy/orbitalreg-api -- env | grep S3_BACKUPAdmin → Settings → Storage exposes a "Test backup endpoint" button that runs a one-shot HeadBucket against the configured backup credentials.
Fix
- Bad creds — rotate the Secret referenced by
s3.backupExistingSecret. The mirror picks up the new value on the next retry tick (no API restart needed). - Network partition — operate without the mirror until the region recovers. Set
storage.backupEndpoint: ""and redeploy; the queue drains to "skipped" and the alert clears. - TLS mismatch — mount the backup endpoint's CA into the API's trust store via
api.extraEnv/SSL_CERT_FILEor upload a CA bundle through Admin → Upstream trust roots. - Lifecycle deleting writes — adjust the backup bucket's lifecycle to exclude the OrbitalReg prefix.
Escalate
- Pending-failure count growing despite the mirror being marked "Healthy" in Admin → Storage → escalate to the OrbitalReg backend team; likely a metric / queue divergence that needs a DB inspection.
- All retries returning
403 SignatureDoesNotMatch→ the backup endpoint may have switched to virtual-hosted style addressing; flips3.backupPathStyle: falseand retry.