`OrbitalRegAPIDown`

Severity: critical · For: 3m · Runbook owner: platform on-call

What it means

Prometheus has not received a successful scrape from the OrbitalReg API /metrics endpoint for three minutes. Either:

the api Deployment is unhealthy / crashlooping,
the Service / Endpoints have no ready Pods,
a network policy is blocking the Prometheus → API path, or
the api Pods are running but /metrics itself returns non-200.

For end users, every package pull, push, and admin click is failing.

Likely causes

A bad release rolled out — readinessProbe is failing on every pod.
The CNPG Postgres primary went away, the API failed its DB liveness probe, and Kubelet killed the pods.
Cluster-level disruption — node drain, overcommit, OOM eviction.
Image pull failure after a tag change (pullPolicy: Always plus a registry outage).
NetworkPolicy regression that blocks the Prom-operator namespace.

Diagnose

bash

# Pods + recent events
kubectl -n orbitalreg get pods -l app.kubernetes.io/component=api
kubectl -n orbitalreg describe pod -l app.kubernetes.io/component=api \
  | sed -n '/Events:/,$p'

# Are scrapes resolving any endpoints?
kubectl -n orbitalreg get endpoints -l app.kubernetes.io/component=api

# Last 200 lines from one pod (replace POD)
kubectl -n orbitalreg logs <POD> --tail=200

# Does /metrics return locally?
kubectl -n orbitalreg port-forward svc/orbitalreg-api 8080:8080 &
curl -sf http://localhost:8080/metrics | head

In the Prom UI, the missing scrape shows up as up{job=~"orbitalreg.*api.*"} == 0. The "Targets" page tells you which pod is unreachable and the last error string.

Fix

Bad release — helm rollback orbitalreg <previous-revision>.
DB went away — investigate OrbitalRegDBDown first; the API recovers automatically once the primary is back.
Image-pull failure — fix the pull-secret / registry; new pods come back online without a re-deploy.
NetworkPolicy — make sure the Prom-operator namespace has the app.kubernetes.io/name: kube-prometheus-stack label that the chart's NetworkPolicy expects.

Escalate

All of the above checked, pods are Ready, /metrics returns 200 locally, but Prometheus still says up=0 → escalate to the platform-Prometheus team. Probably a Prometheus-side scrape config issue.
Pods are CrashLoopBackOff with no obvious cause in logs → file an incident on the OrbitalReg backend team.

OrbitalRegAPIDown ​

What it means ​

Likely causes ​

Diagnose ​

Fix ​

Escalate ​

`OrbitalRegAPIDown`

What it means

Likely causes

Diagnose

Fix

Escalate