OrbitalRegSAMLDown
Severity: critical · For: 5m · Runbook owner: platform on-call
What it means
The OrbitalReg API has not been able to fetch / probe the configured SAML IdP (Azure Entra by default) for 5 minutes. The orbitalreg_saml_idp_up gauge flips to 0 when the federation metadata URL stops responding or the response fails parsing.
Customer impact:
- New logins fail — the redirect to the IdP returns a chrome error page.
- Existing sessions stay valid — a session lasts
auth.sessionTTL(default 8h) so users already in a tab keep working. - Service-account tokens unaffected — they don't go through SAML.
The local break-glass admin (configured via ORBITALREG_BOOTSTRAP_ADMIN_EMAIL / password) still works; this is the path on-call uses to fix the IdP from inside the platform if needed.
Likely causes
- IdP itself is in a maintenance window or actual outage (Azure Entra has them; check the Azure status page first).
- Egress to the IdP is blocked — air-gapped mode (item 24) was toggled on without the IdP being added to the allow-list.
- NetworkPolicy regression that blocks the API → IdP path.
- SAML federation metadata URL changed upstream (tenant migration, app re-registration).
- SP cert expired — the cert in
saml.existingSecretrotated past its validity, the IdP rejects the SAML request signature.
Diagnose
bash
# Probe the federation URL from inside the API pod
kubectl -n orbitalreg exec deploy/orbitalreg-api -- \
curl -sSf "$ORBITALREG_SAML_IDP_METADATA_URL" -o /dev/null \
-w 'http_code=%{http_code} time=%{time_total}\n'
# Recent SAML errors from the API logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=500 \
| grep -i saml | tail -40
# SP cert expiry
kubectl -n orbitalreg get secret orbitalreg-saml -o jsonpath='{.data.tls\.crt}' \
| base64 -d | openssl x509 -noout -enddate
# Air-gapped allow-list status
curl -sf -H "Authorization: Bearer $TOKEN" \
https://orbitalreg.example.com/api/admin/network | jq '.air_gapped_mode, .allowed_egress_hosts'Fix
- IdP outage — wait + monitor the cloud-provider status page. No action needed once it recovers.
- Air-gapped mode — Admin → Settings → Network → "Allowed egress hosts" → add the IdP's host. Or temporarily flip
allow_service_pings=true. - NetworkPolicy — the chart's NetworkPolicy permits egress to
*by default for the api Deployment; if you've tightened it, make sure the IdP host is on the allow-list. - SP cert expired — rotate
saml.existingSecret. The chart's saml-secret template re-renders the SP metadata XML; upload the new metadata to the IdP-side enterprise app. - Metadata URL changed — update
saml.idpMetadataURLand redeploy. New tab → SSO works on the next login.
If users are blocked and the fix needs DB access, log in via the local break-glass path: https://orbitalreg.example.com/login/local.
Escalate
- IdP responds but SAML auth still fails with
signature mismatch→ escalate to the IdP-team; likely the SP-side cert in Entra needs to be re-uploaded. - Probe times out from one cluster but works from another → cluster-egress regression, page the platform-network team.