Skip to content

OrbitalRegScanQueueBacklog

Severity: warning · For: 10m · Runbook owner: platform on-call

What it means

The Detection subsystem queues a scan job for every artifact upload plus every upstream pull. The metric orbitalreg_scan_jobs_pending reflects rows in scan_jobs with status='pending'. Sustained backlog above 1000 means the scanner workers are running slower than the upload rate — promotion gates that wait on scan results will time out.

Unlike API/DB-down, this is rarely a customer-impacting incident; it's a capacity-planning signal. But it is a leading indicator for OrbitalRegHighErrorRate since the upload path eventually starts rejecting once the queue cap is hit.

Likely causes

  1. A scanner is failing — Trivy / Grype / Syft DB pull is broken, every job for that scanner-type is failing-and-retrying.
  2. A bulk import dumped 5000 packages at once (item 8 of the roadmap, mass-import phase 2).
  3. Worker pods are at their concurrency cap and CPU-bound.
  4. A poison-pill artifact is stuck at the head of the queue, getting retried and timing out repeatedly.

Diagnose

bash
# Queue depth + age of the oldest pending job
kubectl -n orbitalreg exec -it deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT scanner, status, COUNT(*),
           MIN(created_at) AS oldest
      FROM scan_jobs
     GROUP BY 1, 2 ORDER BY 1, 2;"
'

# Scanner failure rate from the metric
curl -sf https://orbitalreg.example.com/metrics \
  | grep ^scanner_jobs_failed_total | sort

# Recent worker logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=200 \
  | grep -i 'scan' | tail -50

The Admin → Detection page shows the same data with per-scanner charts.

Fix

  1. Failing scanner — check the offending scanner's vuln-DB pull. Trivy / Grype refresh from upstream every 6h; an air-gapped install needs the mirror configured. Re-run the DB pull manually:
    bash
    kubectl -n orbitalreg exec deploy/orbitalreg-api -- \
      /usr/local/bin/orbital-detect refresh-db --scanner=trivy
  2. Mass import in flight — the queue will drain on its own. If it's eating worker time impacting interactive scans, throttle the import: Admin → Imports → pause active job.
  3. Capacity — bump api.replicas and the per-pod scan-worker concurrency (SCAN_WORKERS_PER_POD env). Each worker is ~250 MiB, ~0.5 CPU under load.
  4. Poison pill — find it via SELECT * FROM scan_jobs WHERE status='pending' ORDER BY attempts DESC LIMIT 5 and either purge the artifact or quarantine the job:
    sql
    UPDATE scan_jobs SET status='quarantined' WHERE id=$1;

Escalate

  • Queue still growing 30 min after capacity bump → escalate to the OrbitalReg backend team. Possibly a deadlock between the worker pool and the DB pool.
  • All scanners failing with vuln-db checksum mismatch → upstream feed corruption, worth posting in the OrbitalReg user channel before paging.

Released under the Apache-2.0 License.