`OrbitalRegHighErrorRate`

Severity: critical · For: 5m · Runbook owner: platform on-call

What it means

More than 5% of API responses over the last 5 minutes returned 5xx. Customers see failed pulls / pushes / admin-page loads. Unlike OrbitalRegAPIDown the API is up — it's returning quickly, just with errors.

Computed from the http_request_duration_seconds_count histogram counter, filtered by status=~"5..".

Likely causes

Database is up but slow — pool exhaustion, lock contention, sequential scan from a missing index. 5xx = pool wait timeout.
S3 backend is rate-limiting (HTTP 503 from MinIO under load, from AWS during a regional event).
Downstream upstream cache / proxy returning 5xx the API can't recover from.
A bad release — new code path returning 500 on a specific handler.

Diagnose

bash

# Top failing handlers in the last 5m
curl -sf https://orbitalreg.example.com/metrics \
  | grep '^http_request_duration_seconds_count.*status="5' \
  | sort -t'"' -k4

# Same data filtered to the last reset (Prom UI):
#   topk(5, sum by (route) (
#     rate(http_request_duration_seconds_count{status=~"5..",
#          job=~"orbitalreg.*api.*"}[5m])
#   ))

# DB pool saturation
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT state, COUNT(*) FROM pg_stat_activity
     WHERE datname=current_database() GROUP BY 1;"
'

# Recent slow queries (requires pg_stat_statements)
kubectl -n orbitalreg exec deploy/orbitalreg-api -- sh -c '
  psql "$ORBITALREG_PG_DSN" -c "
    SELECT calls, mean_exec_time, query
      FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
'

If OTel tracing is enabled (item 27 of the roadmap), grab a trace ID from a 5xx response (X-Trace-Id header) and pull the span tree from your tracing backend — usually pinpoints the slow stage in seconds.

Fix

Pool exhaustion — bump api.replicas (each pod has its own pool) or raise ORBITALREG_DB_POOL_MAX (per-pod). Watch CNPG's max_connections ceiling.
S3 slow — check the S3 endpoint's metrics. If MinIO, restart the bottlenecked node. If AWS, set s3.presignTTL higher to reduce re-signing churn.
Bad release — helm rollback orbitalreg <previous>.
Specific handler regression — quarantine via NetworkPolicy if it's an admin-only path; otherwise, rollback.

Escalate

5xx rate still above threshold 10m after rollback → escalate to the OrbitalReg backend team. Likely the rollback didn't drain in-flight bad state (e.g., a corrupted Redis key); a coordinated flush + restart may be required.
Tracing shows 5xx originating in the S3 client with no obvious pattern → may be a transient regional event, file with the cloud provider before paging internally.

OrbitalRegHighErrorRate ​

What it means ​

Likely causes ​

Diagnose ​

Fix ​

Escalate ​

`OrbitalRegHighErrorRate`

What it means

Likely causes

Diagnose

Fix

Escalate