OrbitalRegCertExpiring
Severity: warning · For: 1h · Runbook owner: platform on-call
What it means
An operator-configured certificate visible to the API has fewer than 14 days of validity remaining. Sources of cert_expiry_days{kind, repo_id}:
kind=client_cert— the client cert OrbitalReg presents to a remote upstream registry (mutual-TLS).kind=ca_bundle— a CA in the trust bundle for verifying upstream TLS or webhook endpoints.kind=signing_key— an X.509 cert wrapping a signing key in the Sigstore / cosign or CMS trust chain.
This is a leading indicator, not an outage. The point of paging at 14 days is that some of these renewals require coordination with external owners (CA / IdP team, vendor support) which takes time.
The bundle's OrbitalRegSAMLDown covers the SAML SP cert separately because expired-SP-cert means immediate auth outage and deserves a critical pager.
Likely causes
- Routine renewal cycle came due (90-day Let's Encrypt, 1-year internal CA).
- cert-manager is failing to renew automatically (rate limit, ACME challenge regression, Issuer broken).
- A vendor changed their CA chain and the new chain hasn't been uploaded to OrbitalReg yet.
Diagnose
bash
# Which cert is closest to expiry?
curl -sf https://orbitalreg.example.com/metrics \
| awk '/^cert_expiry_days/' | sort -k2 -t'}' -n | head -10
# cert-manager Certificate resources (if used)
kubectl -n orbitalreg get certificates
kubectl -n orbitalreg describe certificate <name>
# Inspect the cert directly from the trust store
kubectl -n orbitalreg exec deploy/orbitalreg-api -- \
openssl s_client -connect <upstream>:443 -showcerts < /dev/null 2>/dev/null \
| openssl x509 -noout -enddateThe Admin → Upstream → Trust roots page renders the same data with a "days remaining" column.
Fix
- cert-manager-managed —
kubectl annotate certificate <name> cert-manager.io/issue-temporary-certificate=trueto force a renewal cycle. Investigate the underlying Issuer if it loops. - Manually-uploaded — Admin → Upstream → Trust roots → Replace for that repo. Old cert remains valid in parallel until you delete it.
- Vendor chain rotation — fetch the new chain from the vendor's docs portal, upload via Admin → Upstream → Trust roots, smoke- test one upstream pull, then delete the old chain.
- Signing-key cert — coordinate with the security team. If the key itself is rotating, do that first via Admin → Signing → "Rotate active key" before replacing the cert that wraps it.
Escalate
- Certificate is
cert-manager-managed but stuck inIssuingfor24h → escalate to the platform-cert team; likely an ACME challenge regression.
- Renewal succeeded in cert-manager but the metric still shows < 14 days → API trust store cache hasn't refreshed; restart the api pods (graceful rolling restart picks up the new cert).