OrbitalRegAPIDown
Severity: critical · For: 3m · Runbook owner: platform on-call
What it means
Prometheus has not received a successful scrape from the OrbitalReg API /metrics endpoint for three minutes. Either:
- the api Deployment is unhealthy / crashlooping,
- the Service / Endpoints have no ready Pods,
- a network policy is blocking the Prometheus → API path, or
- the api Pods are running but
/metricsitself returns non-200.
For end users, every package pull, push, and admin click is failing.
Likely causes
- A bad release rolled out — readinessProbe is failing on every pod.
- The CNPG Postgres primary went away, the API failed its DB liveness probe, and Kubelet killed the pods.
- Cluster-level disruption — node drain, overcommit, OOM eviction.
- Image pull failure after a tag change (
pullPolicy: Alwaysplus a registry outage). - NetworkPolicy regression that blocks the Prom-operator namespace.
Diagnose
bash
# Pods + recent events
kubectl -n orbitalreg get pods -l app.kubernetes.io/component=api
kubectl -n orbitalreg describe pod -l app.kubernetes.io/component=api \
| sed -n '/Events:/,$p'
# Are scrapes resolving any endpoints?
kubectl -n orbitalreg get endpoints -l app.kubernetes.io/component=api
# Last 200 lines from one pod (replace POD)
kubectl -n orbitalreg logs <POD> --tail=200
# Does /metrics return locally?
kubectl -n orbitalreg port-forward svc/orbitalreg-api 8080:8080 &
curl -sf http://localhost:8080/metrics | headIn the Prom UI, the missing scrape shows up as up{job=~"orbitalreg.*api.*"} == 0. The "Targets" page tells you which pod is unreachable and the last error string.
Fix
- Bad release —
helm rollback orbitalreg <previous-revision>. - DB went away — investigate OrbitalRegDBDown first; the API recovers automatically once the primary is back.
- Image-pull failure — fix the pull-secret / registry; new pods come back online without a re-deploy.
- NetworkPolicy — make sure the Prom-operator namespace has the
app.kubernetes.io/name: kube-prometheus-stacklabel that the chart's NetworkPolicy expects.
Escalate
- All of the above checked, pods are Ready,
/metricsreturns 200 locally, but Prometheus still saysup=0→ escalate to the platform-Prometheus team. Probably a Prometheus-side scrape config issue. - Pods are CrashLoopBackOff with no obvious cause in logs → file an incident on the OrbitalReg backend team.