`OrbitalRegDBDown`

Severity: critical · For: 2m · Runbook owner: platform on-call

What it means

Prometheus has not received a successful scrape from the OrbitalReg Postgres target for two minutes. In CNPG mode this means the primary exporter / Postgres pod is unhealthy; in legacy StatefulSet mode it means the singleton Pod is gone.

The API can survive a brief DB blip (connection-pool retries) but sustained outage cascades into OrbitalRegAPIDown within a minute or two.

Likely causes

CNPG primary failover in flight — a planned switchover or pod eviction. Standby is being promoted.
PVC at 100% — pg_wal cannot write, Postgres refuses connections.
Operator-issued pg_ctl stop or maintenance window.
CNPG operator itself is down — no reconciliation happening.
CrashLoopBackOff after a config-drift error (bad postgresql.parameters from the Cluster CR).

Diagnose

bash

# Cluster status (CNPG mode)
kubectl -n orbitalreg get cluster orbitalreg-postgres -o wide
kubectl -n orbitalreg describe cluster orbitalreg-postgres

# Pod status
kubectl -n orbitalreg get pods -l cnpg.io/cluster=orbitalreg-postgres
kubectl -n orbitalreg logs <pg-pod> --tail=200

# PVC fill
kubectl -n orbitalreg get pvc

# Operator health
kubectl -n cnpg-system get pods

For the legacy StatefulSet path:

bash

kubectl -n orbitalreg get pods -l app.kubernetes.io/component=postgres
kubectl -n orbitalreg logs orbitalreg-postgres-0 --tail=200

Fix

Failover in flight — wait 60-120s; the CNPG operator will re-elect a primary and up=1 returns. If it doesn't, escalate.

PVC full — expand the volume:

bash

kubectl -n orbitalreg patch pvc orbitalreg-postgres-1 -p \
  '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

StorageClass must be allowVolumeExpansion: true. Postgres reclaims WAL on the next checkpoint.

Bad parameters — revert the offending Helm values, redeploy. CNPG rolls a fresh pod with the previous config.
Operator down — helm rollback cnpg-system in the cnpg-system namespace.

If recovery requires a restore from backup, follow Disaster recovery.

Escalate

CNPG Cluster shows Phase: Setting up primary for more than 5m with no progress → page the platform-DB team. Likely a Barman recovery is hung on S3 throughput.
StatefulSet pod is stuck in Pending because of a missing PV → cluster-level storage issue, page the platform-storage team.

OrbitalRegDBDown ​

What it means ​

Likely causes ​

Diagnose ​

Fix ​

Escalate ​

`OrbitalRegDBDown`

What it means

Likely causes

Diagnose

Fix

Escalate