OrbitalRegHighErrorRate
Severity: critical · For: 5m · Runbook owner: platform on-call
What it means
More than 5% of API responses over the last 5 minutes returned 5xx. Customers see failed pulls / pushes / admin-page loads. Unlike OrbitalRegAPIDown the API is up — it's returning quickly, just with errors.
Computed from the http_request_duration_seconds_count histogram counter, filtered by status=~"5..".
Likely causes
- Database is up but slow — pool exhaustion, lock contention, sequential scan from a missing index. 5xx = pool wait timeout.
- S3 backend is rate-limiting (HTTP 503 from MinIO under load, from AWS during a regional event).
- Downstream upstream cache / proxy returning 5xx the API can't recover from.
- A bad release — new code path returning 500 on a specific handler.
Diagnose
bash
# Top failing handlers in the last 5m
curl -sf https://orbitalreg.example.com/metrics \
| grep '^http_request_duration_seconds_count.*status="5' \
| sort -t'"' -k4
# Same data filtered to the last reset (Prom UI):
# topk(5, sum by (route) (
# rate(http_request_duration_seconds_count{status=~"5..",
# job=~"orbitalreg.*api.*"}[5m])
# ))
# DB pool saturation
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
psql "$ORBITALREG_PG_DSN" -c "
SELECT state, COUNT(*) FROM pg_stat_activity
WHERE datname=current_database() GROUP BY 1;"
'
# Recent slow queries (requires pg_stat_statements)
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
psql "$ORBITALREG_PG_DSN" -c "
SELECT calls, mean_exec_time, query
FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
'If OTel tracing is enabled (item 27 of the roadmap), grab a trace ID from a 5xx response (X-Trace-Id header) and pull the span tree from your tracing backend — usually pinpoints the slow stage in seconds.
Fix
- Pool exhaustion — bump
api.replicas(each pod has its own pool) or raiseORBITALREG_DB_POOL_MAX(per-pod). Watch CNPG'smax_connectionsceiling. - S3 slow — check the S3 endpoint's metrics. If MinIO, restart the bottlenecked node. If AWS, set
s3.presignTTLhigher to reduce re-signing churn. - Bad release —
helm rollback orbitalreg <previous>. - Specific handler regression — quarantine via NetworkPolicy if it's an admin-only path; otherwise, rollback.
Escalate
- 5xx rate still above threshold 10m after rollback → escalate to the OrbitalReg backend team. Likely the rollback didn't drain in-flight bad state (e.g., a corrupted Redis key); a coordinated flush + restart may be required.
- Tracing shows 5xx originating in the S3 client with no obvious pattern → may be a transient regional event, file with the cloud provider before paging internally.