Runbooks

Each PrometheusRule alert shipped by the chart links here through the runbook_url annotation. Open the matching playbook before opening the dashboard — every page follows the same shape so you can read it under page-pressure:

What it means — one paragraph in plain English.
Likely causes — ranked by frequency.
Diagnosis — copy-pasteable kubectl / psql snippets.
Fixes — quick-fix first, nuke-and-restore last.
Escalation — when to wake somebody else up.

The bundle is opt-in through the chart's observability.alerts.enabled toggle. See Helm chart values for the full list of overrides — every alert can be silenced individually, every threshold tuned per environment.

The ten alerts

Critical (page immediately)

Alert	Triggers when	Runbook
`OrbitalRegAPIDown`	API target absent from Prometheus for 3 min	open
`OrbitalRegDBDown`	Postgres target absent from Prometheus for 2 min	open
`OrbitalRegS3MirrorFailing`	Backup S3 endpoint has pending retry queue for 30 min	open
`OrbitalRegHighErrorRate`	API 5xx ratio > 5% for 5 min	open
`OrbitalRegSAMLDown`	SAML IdP unreachable for 5 min	open

Warning (Slack during business hours)

Alert	Triggers when	Runbook
`OrbitalRegBackupStale`	No successful backup verification in 8 days	open
`OrbitalRegScanQueueBacklog`	Detection queue depth > 1000 for 10 min	open
`OrbitalRegCertExpiring`	An operator-managed cert is < 14 days from expiry	open
`OrbitalRegLicenseExpiring`	The active license envelope is < 30 days from expiry	open

Info (dashboard only)

Alert	Triggers when	Runbook
`OrbitalRegHighDiskUsage`	An OrbitalReg PVC > 85% full for 10 min	open

Severity policy

The severities ship as Prometheus labels (severity: critical | warning | info); they're meant to feed AlertManager routing rules the operator owns, not to dictate them. A typical mapping:

yaml

route:
  routes:
    - matchers: [severity="critical"]
      receiver: pagerduty-onCall
    - matchers: [severity="warning"]
      receiver: slack-platform
    - matchers: [severity="info"]
      receiver: slack-platform-noisy

Disabling individual alerts

Every alert has an enabled toggle:

yaml

observability:
  alerts:
    enabled: true
    rules:
      backupStale:
        enabled: false   # the dedicated backup-verify PrometheusRule
                         # already covers this
      highDiskUsage:
        enabled: true
        thresholdPct: 90 # tune per cluster

backupStale ships disabled by default in the bundle because the chart's separate backup-verify-prometheusrule.yaml (gated by backupVerification.alert.enabled) already publishes the equivalent rule with the same expression. Flip the bundle's copy on if you run the verifier in a sibling cluster that scrapes this Prometheus.

Runbooks ​

The ten alerts ​

Critical (page immediately) ​

Warning (Slack during business hours) ​

Info (dashboard only) ​

Severity policy ​

Disabling individual alerts ​