Skip to content

OrbitalRegHighDiskUsage

Severity: info · For: 10m · Runbook owner: platform on-call

What it means

A persistent volume claim whose name matches pvcSelector (default regex: .*orbitalreg.*) has been above 85% utilisation for 10 minutes. Computed from kubelet's kubelet_volume_stats_used_bytes / _capacity_bytes.

This is a heads-up, not an emergency. At 85% you have hours to days of runway depending on growth rate. At 100% Postgres stops writing WAL (OrbitalRegDBDown) and the API stops accepting uploads.

Likely causes

  1. PVC was sized for early-deployment volumes; now actually loaded.
  2. Retention policies aren't running — orphaned artifacts piling up.
  3. WAL retention window in CNPG is longer than the volume can support (default 30d).
  4. Trash isn't being purged — Admin → Trash → "Empty" hasn't been clicked.
  5. Detection scratch directory isn't being cleaned between scans.

Diagnose

bash
# Per-PVC fill
kubectl -n orbitalreg get pvc

# Top contributors inside the Postgres PVC (CNPG mode)
kubectl -n orbitalreg exec orbitalreg-postgres-1 -- \
  du -sh /var/lib/postgresql/data/pgdata/* 2>/dev/null | sort -h

# Size of `trash` table (S3 dedup-aware)
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT pg_size_pretty(SUM(size_bytes)) AS total,
           COUNT(*) FROM trash WHERE purged_at IS NULL;"
'

# Pending retention work
curl -sf -H "Authorization: Bearer $TOKEN" \
  https://orbitalreg.example.com/api/admin/retention/status | jq

The Admin → Storage page renders the same data with growth-rate sparklines.

Fix

  1. Resize the PVC:
    bash
    kubectl -n orbitalreg patch pvc <name> -p \
      '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
    StorageClass must have allowVolumeExpansion: true.
  2. Drain the trash — Admin → Trash → "Empty" (purges to S3 garbage-collected state, frees the rows but not S3 bytes; S3 lifecycle handles those).
  3. Run retention now — Admin → Retention → "Run all policies".
  4. Tighten WAL retention — CNPG Cluster.spec.backup.barmanObjectStore.data.retentionPolicy from 30d to 14d; trade-off is a smaller PITR window.

Escalate

  • PVC at 95% and allowVolumeExpansion: false on the StorageClass → escalate to the platform-storage team; needs a coordinated migration to a larger volume.
  • Disk usage growing despite trash + retention being run → likely a detection scratch leak; file with the OrbitalReg backend team with a du -h --max-depth=2 of the Postgres or API PVC.

Released under the Apache-2.0 License.