Skip to content

Runbooks

Each PrometheusRule alert shipped by the chart links here through the runbook_url annotation. Open the matching playbook before opening the dashboard — every page follows the same shape so you can read it under page-pressure:

  1. What it means — one paragraph in plain English.
  2. Likely causes — ranked by frequency.
  3. Diagnosis — copy-pasteable kubectl / psql snippets.
  4. Fixes — quick-fix first, nuke-and-restore last.
  5. Escalation — when to wake somebody else up.

The bundle is opt-in through the chart's observability.alerts.enabled toggle. See Helm chart values for the full list of overrides — every alert can be silenced individually, every threshold tuned per environment.

The ten alerts

Critical (page immediately)

AlertTriggers whenRunbook
OrbitalRegAPIDownAPI target absent from Prometheus for 3 minopen
OrbitalRegDBDownPostgres target absent from Prometheus for 2 minopen
OrbitalRegS3MirrorFailingBackup S3 endpoint has pending retry queue for 30 minopen
OrbitalRegHighErrorRateAPI 5xx ratio > 5% for 5 minopen
OrbitalRegSAMLDownSAML IdP unreachable for 5 minopen

Warning (Slack during business hours)

AlertTriggers whenRunbook
OrbitalRegBackupStaleNo successful backup verification in 8 daysopen
OrbitalRegScanQueueBacklogDetection queue depth > 1000 for 10 minopen
OrbitalRegCertExpiringAn operator-managed cert is < 14 days from expiryopen
OrbitalRegLicenseExpiringThe active license envelope is < 30 days from expiryopen

Info (dashboard only)

AlertTriggers whenRunbook
OrbitalRegHighDiskUsageAn OrbitalReg PVC > 85% full for 10 minopen

Severity policy

The severities ship as Prometheus labels (severity: critical | warning | info); they're meant to feed AlertManager routing rules the operator owns, not to dictate them. A typical mapping:

yaml
route:
  routes:
    - matchers: [severity="critical"]
      receiver: pagerduty-onCall
    - matchers: [severity="warning"]
      receiver: slack-platform
    - matchers: [severity="info"]
      receiver: slack-platform-noisy

Disabling individual alerts

Every alert has an enabled toggle:

yaml
observability:
  alerts:
    enabled: true
    rules:
      backupStale:
        enabled: false   # the dedicated backup-verify PrometheusRule
                         # already covers this
      highDiskUsage:
        enabled: true
        thresholdPct: 90 # tune per cluster

backupStale ships disabled by default in the bundle because the chart's separate backup-verify-prometheusrule.yaml (gated by backupVerification.alert.enabled) already publishes the equivalent rule with the same expression. Flip the bundle's copy on if you run the verifier in a sibling cluster that scrapes this Prometheus.

Released under the Apache-2.0 License.