Skip to content

OrbitalRegDBDown

Severity: critical · For: 2m · Runbook owner: platform on-call

What it means

Prometheus has not received a successful scrape from the OrbitalReg Postgres target for two minutes. In CNPG mode this means the primary exporter / Postgres pod is unhealthy; in legacy StatefulSet mode it means the singleton Pod is gone.

The API can survive a brief DB blip (connection-pool retries) but sustained outage cascades into OrbitalRegAPIDown within a minute or two.

Likely causes

  1. CNPG primary failover in flight — a planned switchover or pod eviction. Standby is being promoted.
  2. PVC at 100% — pg_wal cannot write, Postgres refuses connections.
  3. Operator-issued pg_ctl stop or maintenance window.
  4. CNPG operator itself is down — no reconciliation happening.
  5. CrashLoopBackOff after a config-drift error (bad postgresql.parameters from the Cluster CR).

Diagnose

bash
# Cluster status (CNPG mode)
kubectl -n orbitalreg get cluster orbitalreg-postgres -o wide
kubectl -n orbitalreg describe cluster orbitalreg-postgres

# Pod status
kubectl -n orbitalreg get pods -l cnpg.io/cluster=orbitalreg-postgres
kubectl -n orbitalreg logs <pg-pod> --tail=200

# PVC fill
kubectl -n orbitalreg get pvc

# Operator health
kubectl -n cnpg-system get pods

For the legacy StatefulSet path:

bash
kubectl -n orbitalreg get pods -l app.kubernetes.io/component=postgres
kubectl -n orbitalreg logs orbitalreg-postgres-0 --tail=200

Fix

  1. Failover in flight — wait 60-120s; the CNPG operator will re-elect a primary and up=1 returns. If it doesn't, escalate.
  2. PVC full — expand the volume:
    bash
    kubectl -n orbitalreg patch pvc orbitalreg-postgres-1 -p \
      '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
    StorageClass must be allowVolumeExpansion: true. Postgres reclaims WAL on the next checkpoint.
  3. Bad parameters — revert the offending Helm values, redeploy. CNPG rolls a fresh pod with the previous config.
  4. Operator downhelm rollback cnpg-system in the cnpg-system namespace.

If recovery requires a restore from backup, follow Disaster recovery.

Escalate

  • CNPG Cluster shows Phase: Setting up primary for more than 5m with no progress → page the platform-DB team. Likely a Barman recovery is hung on S3 throughput.
  • StatefulSet pod is stuck in Pending because of a missing PV → cluster-level storage issue, page the platform-storage team.

Released under the Apache-2.0 License.