OrbitalRegDBDown
Severity: critical · For: 2m · Runbook owner: platform on-call
What it means
Prometheus has not received a successful scrape from the OrbitalReg Postgres target for two minutes. In CNPG mode this means the primary exporter / Postgres pod is unhealthy; in legacy StatefulSet mode it means the singleton Pod is gone.
The API can survive a brief DB blip (connection-pool retries) but sustained outage cascades into OrbitalRegAPIDown within a minute or two.
Likely causes
- CNPG primary failover in flight — a planned switchover or pod eviction. Standby is being promoted.
- PVC at 100% —
pg_walcannot write, Postgres refuses connections. - Operator-issued
pg_ctl stopor maintenance window. - CNPG operator itself is down — no reconciliation happening.
- CrashLoopBackOff after a config-drift error (bad
postgresql.parametersfrom the Cluster CR).
Diagnose
bash
# Cluster status (CNPG mode)
kubectl -n orbitalreg get cluster orbitalreg-postgres -o wide
kubectl -n orbitalreg describe cluster orbitalreg-postgres
# Pod status
kubectl -n orbitalreg get pods -l cnpg.io/cluster=orbitalreg-postgres
kubectl -n orbitalreg logs <pg-pod> --tail=200
# PVC fill
kubectl -n orbitalreg get pvc
# Operator health
kubectl -n cnpg-system get podsFor the legacy StatefulSet path:
bash
kubectl -n orbitalreg get pods -l app.kubernetes.io/component=postgres
kubectl -n orbitalreg logs orbitalreg-postgres-0 --tail=200Fix
- Failover in flight — wait 60-120s; the CNPG operator will re-elect a primary and
up=1returns. If it doesn't, escalate. - PVC full — expand the volume:bashStorageClass must be
kubectl -n orbitalreg patch pvc orbitalreg-postgres-1 -p \ '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'allowVolumeExpansion: true. Postgres reclaims WAL on the next checkpoint. - Bad parameters — revert the offending Helm values, redeploy. CNPG rolls a fresh pod with the previous config.
- Operator down —
helm rollback cnpg-systemin thecnpg-systemnamespace.
If recovery requires a restore from backup, follow Disaster recovery.
Escalate
- CNPG Cluster shows
Phase: Setting up primaryfor more than 5m with no progress → page the platform-DB team. Likely a Barman recovery is hung on S3 throughput. - StatefulSet pod is stuck in
Pendingbecause of a missing PV → cluster-level storage issue, page the platform-storage team.