Skip to content

OrbitalRegSAMLDown

Severity: critical · For: 5m · Runbook owner: platform on-call

What it means

The OrbitalReg API has not been able to fetch / probe the configured SAML IdP (Azure Entra by default) for 5 minutes. The orbitalreg_saml_idp_up gauge flips to 0 when the federation metadata URL stops responding or the response fails parsing.

Customer impact:

  • New logins fail — the redirect to the IdP returns a chrome error page.
  • Existing sessions stay valid — a session lasts auth.sessionTTL (default 8h) so users already in a tab keep working.
  • Service-account tokens unaffected — they don't go through SAML.

The local break-glass admin (configured via ORBITALREG_BOOTSTRAP_ADMIN_EMAIL / password) still works; this is the path on-call uses to fix the IdP from inside the platform if needed.

Likely causes

  1. IdP itself is in a maintenance window or actual outage (Azure Entra has them; check the Azure status page first).
  2. Egress to the IdP is blocked — air-gapped mode (item 24) was toggled on without the IdP being added to the allow-list.
  3. NetworkPolicy regression that blocks the API → IdP path.
  4. SAML federation metadata URL changed upstream (tenant migration, app re-registration).
  5. SP cert expired — the cert in saml.existingSecret rotated past its validity, the IdP rejects the SAML request signature.

Diagnose

bash
# Probe the federation URL from inside the API pod
kubectl -n orbitalreg exec deploy/orbitalreg-api -- \
  curl -sSf "$ORBITALREG_SAML_IDP_METADATA_URL" -o /dev/null \
  -w 'http_code=%{http_code} time=%{time_total}\n'

# Recent SAML errors from the API logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=500 \
  | grep -i saml | tail -40

# SP cert expiry
kubectl -n orbitalreg get secret orbitalreg-saml -o jsonpath='{.data.tls\.crt}' \
  | base64 -d | openssl x509 -noout -enddate

# Air-gapped allow-list status
curl -sf -H "Authorization: Bearer $TOKEN" \
  https://orbitalreg.example.com/api/admin/network | jq '.air_gapped_mode, .allowed_egress_hosts'

Fix

  1. IdP outage — wait + monitor the cloud-provider status page. No action needed once it recovers.
  2. Air-gapped mode — Admin → Settings → Network → "Allowed egress hosts" → add the IdP's host. Or temporarily flip allow_service_pings=true.
  3. NetworkPolicy — the chart's NetworkPolicy permits egress to * by default for the api Deployment; if you've tightened it, make sure the IdP host is on the allow-list.
  4. SP cert expired — rotate saml.existingSecret. The chart's saml-secret template re-renders the SP metadata XML; upload the new metadata to the IdP-side enterprise app.
  5. Metadata URL changed — update saml.idpMetadataURL and redeploy. New tab → SSO works on the next login.

If users are blocked and the fix needs DB access, log in via the local break-glass path: https://orbitalreg.example.com/login/local.

Escalate

  • IdP responds but SAML auth still fails with signature mismatch → escalate to the IdP-team; likely the SP-side cert in Entra needs to be re-uploaded.
  • Probe times out from one cluster but works from another → cluster-egress regression, page the platform-network team.

Released under the Apache-2.0 License.