`OrbitalRegS3MirrorFailing`

Severity: critical · For: 30m · Runbook owner: platform on-call

What it means

The optional dual-write S3 mirror (configured via storage.backupEndpoint plus its own credentials) is failing to copy objects from the primary bucket to the backup bucket. The retry queue has had at least one failure pending for over 30 minutes.

Primary writes still succeed — customer uploads aren't failing — but the disaster-recovery property the mirror buys you (lose primary S3, fall over to backup) no longer holds.

The metric orbitalreg_s3_backup_pending_failures is exported by the storage subsystem; it counts objects that have failed at least once and have not yet been replicated successfully.

Likely causes

Backup endpoint credentials rotated upstream without the chart's Secret being updated.
Backup bucket out of quota / lifecycle policy is deleting objects the mirror is trying to write.
Network partition between the cluster and the backup region.
TLS chain mismatch — backup endpoint presents a cert the API's trust store doesn't validate.
The primary write succeeded but a Content-MD5 header mismatch is causing the backup put to fail.

Diagnose

bash

# Pending-failure count (gauge)
curl -sf https://orbitalreg.example.com/metrics \
  | grep ^orbitalreg_s3_backup_pending_failures

# Last 50 mirror errors from the API logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=500 \
  | grep -i 'backup mirror' | tail -50

# Manual probe of the backup endpoint with the chart-mounted creds
kubectl -n orbitalreg exec deploy/orbitalreg-api -- env | grep S3_BACKUP

Admin → Settings → Storage exposes a "Test backup endpoint" button that runs a one-shot HeadBucket against the configured backup credentials.

Fix

Bad creds — rotate the Secret referenced by s3.backupExistingSecret. The mirror picks up the new value on the next retry tick (no API restart needed).
Network partition — operate without the mirror until the region recovers. Set storage.backupEndpoint: "" and redeploy; the queue drains to "skipped" and the alert clears.
TLS mismatch — mount the backup endpoint's CA into the API's trust store via api.extraEnv / SSL_CERT_FILE or upload a CA bundle through Admin → Upstream trust roots.
Lifecycle deleting writes — adjust the backup bucket's lifecycle to exclude the OrbitalReg prefix.

Escalate

Pending-failure count growing despite the mirror being marked "Healthy" in Admin → Storage → escalate to the OrbitalReg backend team; likely a metric / queue divergence that needs a DB inspection.
All retries returning 403 SignatureDoesNotMatch → the backup endpoint may have switched to virtual-hosted style addressing; flip s3.backupPathStyle: false and retry.

OrbitalRegS3MirrorFailing ​

What it means ​

Likely causes ​

Diagnose ​

Fix ​

Escalate ​

`OrbitalRegS3MirrorFailing`

What it means

Likely causes

Diagnose

Fix

Escalate