OrbitalRegScanQueueBacklog
Severity: warning · For: 10m · Runbook owner: platform on-call
What it means
The Detection subsystem queues a scan job for every artifact upload plus every upstream pull. The metric orbitalreg_scan_jobs_pending reflects rows in scan_jobs with status='pending'. Sustained backlog above 1000 means the scanner workers are running slower than the upload rate — promotion gates that wait on scan results will time out.
Unlike API/DB-down, this is rarely a customer-impacting incident; it's a capacity-planning signal. But it is a leading indicator for OrbitalRegHighErrorRate since the upload path eventually starts rejecting once the queue cap is hit.
Likely causes
- A scanner is failing — Trivy / Grype / Syft DB pull is broken, every job for that scanner-type is failing-and-retrying.
- A bulk import dumped 5000 packages at once (item 8 of the roadmap, mass-import phase 2).
- Worker pods are at their concurrency cap and CPU-bound.
- A poison-pill artifact is stuck at the head of the queue, getting retried and timing out repeatedly.
Diagnose
bash
# Queue depth + age of the oldest pending job
kubectl -n orbitalreg exec -it deploy/orbitalreg-api -- sh -c '
psql "$ORBITALREG_PG_DSN" -c "
SELECT scanner, status, COUNT(*),
MIN(created_at) AS oldest
FROM scan_jobs
GROUP BY 1, 2 ORDER BY 1, 2;"
'
# Scanner failure rate from the metric
curl -sf https://orbitalreg.example.com/metrics \
| grep ^scanner_jobs_failed_total | sort
# Recent worker logs
kubectl -n orbitalreg logs -l app.kubernetes.io/component=api --tail=200 \
| grep -i 'scan' | tail -50The Admin → Detection page shows the same data with per-scanner charts.
Fix
- Failing scanner — check the offending scanner's vuln-DB pull. Trivy / Grype refresh from upstream every 6h; an air-gapped install needs the mirror configured. Re-run the DB pull manually:bash
kubectl -n orbitalreg exec deploy/orbitalreg-api -- \ /usr/local/bin/orbital-detect refresh-db --scanner=trivy - Mass import in flight — the queue will drain on its own. If it's eating worker time impacting interactive scans, throttle the import: Admin → Imports → pause active job.
- Capacity — bump
api.replicasand the per-pod scan-worker concurrency (SCAN_WORKERS_PER_PODenv). Each worker is ~250 MiB, ~0.5 CPU under load. - Poison pill — find it via
SELECT * FROM scan_jobs WHERE status='pending' ORDER BY attempts DESC LIMIT 5and either purge the artifact or quarantine the job:sqlUPDATE scan_jobs SET status='quarantined' WHERE id=$1;
Escalate
- Queue still growing 30 min after capacity bump → escalate to the OrbitalReg backend team. Possibly a deadlock between the worker pool and the DB pool.
- All scanners failing with
vuln-db checksum mismatch→ upstream feed corruption, worth posting in the OrbitalReg user channel before paging.