Skip to content

OrbitalRegHighErrorRate

Severity: critical · For: 5m · Runbook owner: platform on-call

What it means

More than 5% of API responses over the last 5 minutes returned 5xx. Customers see failed pulls / pushes / admin-page loads. Unlike OrbitalRegAPIDown the API is up — it's returning quickly, just with errors.

Computed from the http_request_duration_seconds_count histogram counter, filtered by status=~"5..".

Likely causes

  1. Database is up but slow — pool exhaustion, lock contention, sequential scan from a missing index. 5xx = pool wait timeout.
  2. S3 backend is rate-limiting (HTTP 503 from MinIO under load, from AWS during a regional event).
  3. Downstream upstream cache / proxy returning 5xx the API can't recover from.
  4. A bad release — new code path returning 500 on a specific handler.

Diagnose

bash
# Top failing handlers in the last 5m
curl -sf https://orbitalreg.example.com/metrics \
  | grep '^http_request_duration_seconds_count.*status="5' \
  | sort -t'"' -k4

# Same data filtered to the last reset (Prom UI):
#   topk(5, sum by (route) (
#     rate(http_request_duration_seconds_count{status=~"5..",
#          job=~"orbitalreg.*api.*"}[5m])
#   ))

# DB pool saturation
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT state, COUNT(*) FROM pg_stat_activity
     WHERE datname=current_database() GROUP BY 1;"
'

# Recent slow queries (requires pg_stat_statements)
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT calls, mean_exec_time, query
      FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
'

If OTel tracing is enabled (item 27 of the roadmap), grab a trace ID from a 5xx response (X-Trace-Id header) and pull the span tree from your tracing backend — usually pinpoints the slow stage in seconds.

Fix

  1. Pool exhaustion — bump api.replicas (each pod has its own pool) or raise ORBITALREG_DB_POOL_MAX (per-pod). Watch CNPG's max_connections ceiling.
  2. S3 slow — check the S3 endpoint's metrics. If MinIO, restart the bottlenecked node. If AWS, set s3.presignTTL higher to reduce re-signing churn.
  3. Bad releasehelm rollback orbitalreg <previous>.
  4. Specific handler regression — quarantine via NetworkPolicy if it's an admin-only path; otherwise, rollback.

Escalate

  • 5xx rate still above threshold 10m after rollback → escalate to the OrbitalReg backend team. Likely the rollback didn't drain in-flight bad state (e.g., a corrupted Redis key); a coordinated flush + restart may be required.
  • Tracing shows 5xx originating in the S3 client with no obvious pattern → may be a transient regional event, file with the cloud provider before paging internally.

Released under the Apache-2.0 License.