Skip to content

OrbitalRegCertExpiring

Severity: warning · For: 1h · Runbook owner: platform on-call

What it means

An operator-configured certificate visible to the API has fewer than 14 days of validity remaining. Sources of cert_expiry_days{kind, repo_id}:

  • kind=client_cert — the client cert OrbitalReg presents to a remote upstream registry (mutual-TLS).
  • kind=ca_bundle — a CA in the trust bundle for verifying upstream TLS or webhook endpoints.
  • kind=signing_key — an X.509 cert wrapping a signing key in the Sigstore / cosign or CMS trust chain.

This is a leading indicator, not an outage. The point of paging at 14 days is that some of these renewals require coordination with external owners (CA / IdP team, vendor support) which takes time.

The bundle's OrbitalRegSAMLDown covers the SAML SP cert separately because expired-SP-cert means immediate auth outage and deserves a critical pager.

Likely causes

  1. Routine renewal cycle came due (90-day Let's Encrypt, 1-year internal CA).
  2. cert-manager is failing to renew automatically (rate limit, ACME challenge regression, Issuer broken).
  3. A vendor changed their CA chain and the new chain hasn't been uploaded to OrbitalReg yet.

Diagnose

bash
# Which cert is closest to expiry?
curl -sf https://orbitalreg.example.com/metrics \
  | awk '/^cert_expiry_days/' | sort -k2 -t'}' -n | head -10

# cert-manager Certificate resources (if used)
kubectl -n orbitalreg get certificates
kubectl -n orbitalreg describe certificate <name>

# Inspect the cert directly from the trust store
kubectl -n orbitalreg exec deploy/orbitalreg-api -- \
  openssl s_client -connect <upstream>:443 -showcerts < /dev/null 2>/dev/null \
  | openssl x509 -noout -enddate

The Admin → Upstream → Trust roots page renders the same data with a "days remaining" column.

Fix

  1. cert-manager-managedkubectl annotate certificate <name> cert-manager.io/issue-temporary-certificate=true to force a renewal cycle. Investigate the underlying Issuer if it loops.
  2. Manually-uploaded — Admin → Upstream → Trust roots → Replace for that repo. Old cert remains valid in parallel until you delete it.
  3. Vendor chain rotation — fetch the new chain from the vendor's docs portal, upload via Admin → Upstream → Trust roots, smoke- test one upstream pull, then delete the old chain.
  4. Signing-key cert — coordinate with the security team. If the key itself is rotating, do that first via Admin → Signing → "Rotate active key" before replacing the cert that wraps it.

Escalate

  • Certificate is cert-manager-managed but stuck in Issuing for

    24h → escalate to the platform-cert team; likely an ACME challenge regression.

  • Renewal succeeded in cert-manager but the metric still shows < 14 days → API trust store cache hasn't refreshed; restart the api pods (graceful rolling restart picks up the new cert).

Released under the Apache-2.0 License.