Skip to content

OrbitalRegAPIDown

Severity: critical · For: 3m · Runbook owner: platform on-call

What it means

Prometheus has not received a successful scrape from the OrbitalReg API /metrics endpoint for three minutes. Either:

  • the api Deployment is unhealthy / crashlooping,
  • the Service / Endpoints have no ready Pods,
  • a network policy is blocking the Prometheus → API path, or
  • the api Pods are running but /metrics itself returns non-200.

For end users, every package pull, push, and admin click is failing.

Likely causes

  1. A bad release rolled out — readinessProbe is failing on every pod.
  2. The CNPG Postgres primary went away, the API failed its DB liveness probe, and Kubelet killed the pods.
  3. Cluster-level disruption — node drain, overcommit, OOM eviction.
  4. Image pull failure after a tag change (pullPolicy: Always plus a registry outage).
  5. NetworkPolicy regression that blocks the Prom-operator namespace.

Diagnose

bash
# Pods + recent events
kubectl -n orbitalreg get pods -l app.kubernetes.io/component=api
kubectl -n orbitalreg describe pod -l app.kubernetes.io/component=api \
  | sed -n '/Events:/,$p'

# Are scrapes resolving any endpoints?
kubectl -n orbitalreg get endpoints -l app.kubernetes.io/component=api

# Last 200 lines from one pod (replace POD)
kubectl -n orbitalreg logs <POD> --tail=200

# Does /metrics return locally?
kubectl -n orbitalreg port-forward svc/orbitalreg-api 8080:8080 &
curl -sf http://localhost:8080/metrics | head

In the Prom UI, the missing scrape shows up as up{job=~"orbitalreg.*api.*"} == 0. The "Targets" page tells you which pod is unreachable and the last error string.

Fix

  1. Bad releasehelm rollback orbitalreg <previous-revision>.
  2. DB went away — investigate OrbitalRegDBDown first; the API recovers automatically once the primary is back.
  3. Image-pull failure — fix the pull-secret / registry; new pods come back online without a re-deploy.
  4. NetworkPolicy — make sure the Prom-operator namespace has the app.kubernetes.io/name: kube-prometheus-stack label that the chart's NetworkPolicy expects.

Escalate

  • All of the above checked, pods are Ready, /metrics returns 200 locally, but Prometheus still says up=0 → escalate to the platform-Prometheus team. Probably a Prometheus-side scrape config issue.
  • Pods are CrashLoopBackOff with no obvious cause in logs → file an incident on the OrbitalReg backend team.

Released under the Apache-2.0 License.