Skip to content

Monitoring

OrbitalReg exposes a Prometheus-compatible /metrics endpoint and ships two Grafana dashboards plus a Prometheus alert + recording-rule bundle so a customer-side observability stack can be wired up in a single afternoon. This page is the operator reference: the metric catalogue, the import recipes, the Alertmanager routing examples, and the PromQL snippets the on-call rotation reaches for first.

The high-level Observability page covers logs and traces alongside metrics; this page is the metrics deep-dive.

What ships in the box

AssetPathPurpose
/metrics endpointAPI process, served by prometheus/promhttp.HandlerPull-mode scrape target for Prometheus
Metric registryapi/internal/metrics/Single source of truth for every series name + label set
Overview dashboarddocs/grafana/orbitalreg-overview.json12-panel RED + USE summary for an SRE morning glance
Deep-dive dashboarddocs/grafana/orbitalreg-deep-dive.json18-panel per-route / per-op / per-peer breakdown
Alert + recording rulesdocs/prometheus/orbitalreg.alerts.yamlTen on-call alerts plus seven pre-aggregated recording rules
Dashboard CI guardmake grafana-dashboard-testRefuses any panel that references an undeclared metric
Alert-rule CI guardmake prometheus-alerts-testRefuses any alert that references an undeclared metric

The CI guards mean that renaming a metric in the registry is a compile-time-style error: the dashboard or rule file fails its test the same commit that drops the underlying series, so a stale asset never reaches a customer install.

Scrape configuration

Point Prometheus at the API process. The endpoint is served on the same listener as the rest of the API (no separate port-bind today) and it does not require authentication so it can be scraped with the customer's existing service-discovery rules.

yaml
# prometheus.yml — minimal scrape job
scrape_configs:
  - job_name: orbitalreg-api
    metrics_path: /metrics
    scrape_interval: 30s
    scrape_timeout: 10s
    static_configs:
      - targets:
          - orbitalreg-api.orbitalreg.svc.cluster.local:8080

Kubernetes deployments using the kube-prometheus-stack get a ServiceMonitor selector for free — match on the app.kubernetes.io/name=orbitalreg-api label that the chart already sets. A 30-second scrape interval is the recommended floor: histogram buckets need ≥4 scrapes inside a 5-minute window for the histogram_quantile() math in the bundled dashboards to be stable.

Metric reference

Every series below is declared in api/internal/metrics/. Cardinality-sensitive labels are noted; everything else is bounded by a finite enum at the call site.

HTTP layer (RED)

MetricTypeLabelsCardinality
http_requests_totalcountermethod, route, statusbounded — route is the chi RoutePattern (e.g. /api/v1/repos/{id}), not the raw URL
http_request_duration_secondshistogrammethod, route, statussame as above; default Prometheus buckets
http_requests_inflightgaugesingle series

The chi middleware in api/internal/middleware/metrics.go wraps every inbound request, so adding a new handler picks up RED coverage automatically.

Database (USE)

MetricTypeLabelsNotes
pg_pool_acquired_connectionsgaugeSampled every 10s by metrics.RunPgPoolPoller
pg_pool_idle_connectionsgaugeSampled every 10s
pg_pool_max_connectionsgaugePool max from pgxpool.Stat()
pg_pool_total_connectionsgaugeacquired + idle + constructing

The recorded ratio pg_pool:saturation:ratio = pg_pool_acquired_connections / pg_pool_max_connections is what the OrbitalRegPgPoolNearExhausted alert fires on.

S3 / MinIO

MetricTypeLabelsNotes
s3_requests_totalcounteropopput/get/stat/delete/copy/presign/list/put_by_key
s3_request_duration_secondshistogramoplatency buckets tuned for object-store calls (5ms → 30s)
s3_request_errors_totalcounterop, err_classerr_classnotfound/auth/server/client/timeout/canceled/network/other

Folded into startS3Span in internal/storage/s3.go so every minio call records latency + outcome without per-callsite plumbing.

Scan dispatcher

MetricTypeLabelsNotes
scan_queue_depthgaugestatestatepending/claimed/running/done/failed/skipped — sampled every 15s
scan_jobs_processed_totalcounteroutcomeoutcomedone/failed/skipped
scan_job_duration_secondshistogramscannerper-scanner wall-clock; bucket range 10ms → 5min

Retention sweeper

MetricTypeLabelsNotes
retention_sweep_duration_secondshistogramone full sweep cycle, regardless of policy mix
retention_artifacts_deleted_totalcounterrepo_id, dry_runrepo-keyed for SRE pivots; complements the operator-named retention_runs_total{policy_name,outcome} series

Geo-Sync replication

MetricTypeLabelsNotes
replication_lag_secondsgaugepeerseconds since last successful push; zeroed on success
replication_outbox_depthgaugepeergeosync_outbox rows ahead of the peer's cursor
replication_push_errors_totalcounterpeer, err_classerr_class is bucketed via ClassifyReplicationErr (outbox_read / marshal / request / throttle / deliver / peer_5xx / peer_4xx / other)

peer is the user-defined peer ID; cardinality matches the row count in geosync_peer_state (single-digit on every install we've seen).

License

MetricTypeLabelsNotes
license_checks_totalcounteroutcomeoutcometrial_active/trial_expired/licensed_active/licensed_expired/invalid/derive_error
license_expires_in_secondsgaugenegative once expired; drives the renewal alerts

Build identity

MetricTypeLabelsNotes
orbitalreg_build_infogauge=1version, commit, go_version, built_atThe "what's running here" beacon — pin to a Grafana table panel
orbitalreg_uptime_secondsgaugeTicks once per second; resets on restart

Log volume (item 71 Phase G)

Sampled by the applog metrics handler that wraps the JSON encoder (api/internal/applog/metrics.go); both counters increment per emitted record.

MetricTypeLabelsNotes
log_lines_emitted_totalcountercomponent, levelOne increment per slog record; cardinality bounded by the canonical applog component enum × {debug,info,warn,error}. component=unknown is the fallback bucket for records emitted before any applog.Component tag.
log_bytes_emitted_totalcountercomponentApproximate JSON-serialised wire bytes (post-redaction); use for SIEM ingestion-budget forecasts.

The OrbitalRegLogVolumeSpike alert pages on a 5× spike over the prior-hour baseline — see the Log volume spike runbook below.

Recording rules

The bundled orbitalreg.alerts.yaml ships seven recording rules so the dashboards (and the alerts themselves) don't pay the histogram- quantile + division cost on every evaluation:

Recording ruleUnderlying expression
route:http_requests:rate5mper-route request rate (sum by (route, method) (rate(http_requests_total[5m])))
route:http_requests:error_rate5mper-route 5xx ratio
route:http_request_duration_seconds:p99_5mper-route p99 latency (5m window)
cluster:http_requests:error_rate5mcluster-wide 5xx ratio (5m fast burn)
cluster:http_requests:error_rate1hcluster-wide 5xx ratio (1h slow burn)
pg_pool:saturation:ratioacquired / max — the saturation indicator
s3:requests:error_rate10mper-op S3 error ratio (10m window)

Recording-rule names follow the documented Prometheus convention level:metric:operation so a dashboard reader can read what the series is without opening the YAML.

Grafana — importing the dashboards

The dashboards live in docs/grafana/ and use a templated ${DS_PROMETHEUS} datasource so the import wizard binds cleanly to whatever Prometheus instance you point them at.

  1. Open Grafana → Dashboards → Import.
  2. Click Upload JSON file and select either orbitalreg-overview.json or orbitalreg-deep-dive.json. (You can also paste the file contents into the "Import via panel json" text area.)
  3. On the next screen pick the Prometheus datasource that scrapes /metrics (the ${DS_PROMETHEUS} placeholder is bound here).
  4. Click Import.

To version-control the import, keep the JSON files alongside your Grafana provisioning config:

yaml
# grafana-provisioning/dashboards/orbitalreg.yaml
apiVersion: 1
providers:
  - name: orbitalreg
    folder: OrbitalReg
    type: file
    options:
      path: /var/lib/grafana/dashboards/orbitalreg

…and drop the two JSON files under /var/lib/grafana/dashboards/orbitalreg/. Grafana picks them up at startup and re-syncs whenever the file changes.

Dashboard layout

  • Overview is the morning glance: RPS, 5xx rate, latency p50/p95/p99, inflight, pg-pool saturation, S3 latency p99 per-op, S3 error rate, scan-queue depth per state, retention sweep p99, replication lag per peer, license expires-in days, build-info table.
  • Deep-dive is the triage view: per-route topk RPS / p99 / 5xx %, status-code distribution, per-method, pool saturation %, per-op S3 throughput + p95 + error class, scan-queue per state, per-scanner duration p99, outcome counters, retention deletes per repo, per-peer replication lag / outbox / push errors, license checks per outcome, uptime.

The deep-dive dashboard's per-route panels use the route:http_requests:rate5m and route:http_request_duration_seconds:p99_5m recording rules — they will only populate once the rule file is loaded into Prometheus.

Prometheus — loading the alert rules

Drop orbitalreg.alerts.yaml into your Prometheus rule directory and reference it from the main config:

yaml
# prometheus.yml
rule_files:
  - /etc/prometheus/rules/orbitalreg.alerts.yaml

Reload Prometheus (POST /-/reload or SIGHUP) and confirm under Status → Rules that all seven recording rules and ten alert rules show up green. The alert-rule CI guard (make prometheus-alerts-test) parses the same file and refuses any PromQL that references an undeclared metric, so a clean test run is a strong signal the file will load against any Prometheus that scrapes the API.

Severity convention

Every alert carries two routing labels:

  • severitypage (operator must wake up; SLO burning now) or warning (investigate within business hours).
  • subsystemapi / postgres / storage / scan / geosync / license / retention. Lets Alertmanager route per team without inspecting the alert name.

Alertmanager routing examples

The bundled rules expose the labels Alertmanager needs to route without any extra wrapping. Three minimal recipes follow — pick the one that matches how your team already does on-call.

PagerDuty (page-severity → wake the operator)

yaml
# alertmanager.yml — PagerDuty for page, Slack for warning
route:
  receiver: default
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: pagerduty-orbitalreg
      continue: false
    - matchers:
        - severity = warning
      receiver: slack-orbitalreg
      continue: false

receivers:
  - name: default
    # silently absorb anything that doesn't match the routes above.

  - name: pagerduty-orbitalreg
    pagerduty_configs:
      - service_key: ${PAGERDUTY_INTEGRATION_KEY}
        description: '{{ .CommonAnnotations.summary }}'
        details:
          subsystem: '{{ .CommonLabels.subsystem }}'
          firing:    '{{ .Alerts.Firing | len }}'
          runbook:   '{{ .CommonAnnotations.runbook_url }}'

  - name: slack-orbitalreg
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#orbitalreg-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: |
          {{ range .Alerts -}}
          *{{ .Labels.alertname }}* — {{ .Annotations.description }}
          <{{ .Annotations.runbook_url }}|Runbook>
          {{ end }}

Slack-only (warnings + pages both go to chat)

For early-stage installs that don't yet have a paging contract:

yaml
route:
  receiver: slack-orbitalreg
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: slack-orbitalreg
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#orbitalreg-alerts'
        send_resolved: true
        title: '[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: |
          {{ range .Alerts -}}
          *{{ .Labels.alertname }}* — {{ .Annotations.description }}
          <{{ .Annotations.runbook_url }}|Runbook>
          {{ end }}

Email-only (air-gapped, no SaaS notifier)

The customer profile that runs OrbitalReg in-cluster typically also runs Postfix or a corporate SMTP relay. Email is the lowest-common- denominator channel and works without any external connectivity:

yaml
global:
  smtp_smarthost: smtp.internal.example.com:587
  smtp_from: alertmanager@example.com
  smtp_auth_username: alertmanager
  smtp_auth_password_file: /etc/alertmanager/smtp.password

route:
  receiver: email-orbitalreg
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: email-orbitalreg-page
      continue: false

receivers:
  - name: email-orbitalreg
    email_configs:
      - to: orbitalreg-warning@example.com
        send_resolved: true

  - name: email-orbitalreg-page
    email_configs:
      - to: orbitalreg-oncall@example.com
        send_resolved: true
        # Most corporate mail servers throttle bursts; bumping the
        # group_interval at the route level keeps the second burst
        # arriving 5 minutes after the first.

Sample PromQL — the five most common triage questions

Open the deep-dive dashboard first; these are the queries to paste into Prometheus's expression browser when a panel doesn't tell the whole story.

1. Where are my errors coming from?

Top routes by 5xx rate over the last 15 minutes:

text
topk(5,
  sum by (route) (rate(http_requests_total{status=~"5.."}[15m]))
)

Compare against the per-route 5xx ratio (filters out hot routes that are simply busy):

text
topk(5, route:http_requests:error_rate5m > 0)

2. Why is my latency spiking?

The five worst routes by p99 latency, recording-rule pre-aggregated:

text
topk(5, route:http_request_duration_seconds:p99_5m)

Cross-correlate with backend backpressure to confirm the spike is downstream of the API layer:

text
histogram_quantile(0.99,
  sum by (le, op) (rate(s3_request_duration_seconds_bucket[5m])))
text
pg_pool:saturation:ratio

3. Is the scan dispatcher draining?

Pending + claimed + running depth — should hover near zero on a healthy install:

text
sum(scan_queue_depth{state=~"pending|claimed|running"})

Failure-rate share over the last 15 minutes (≥0.2 trips OrbitalRegScanJobsFailing):

text
sum(rate(scan_jobs_processed_total{outcome="failed"}[15m]))
  /
clamp_min(sum(rate(scan_jobs_processed_total[15m])), 1)

Per-scanner p99 to find the misbehaving backend:

text
histogram_quantile(0.99,
  sum by (le, scanner) (rate(scan_job_duration_seconds_bucket[15m])))

4. Which peer is replication-lagging?

text
topk(3, replication_lag_seconds)

Confirm the producer is healthy by checking the outbox depth — a deep outbox + low push-error rate means the consumer is slow, not the producer:

text
replication_outbox_depth
text
sum by (peer, err_class) (rate(replication_push_errors_total[15m]))

5. How healthy is the database pool?

The headline ratio:

text
pg_pool:saturation:ratio

Connection accounting, useful when the ratio looks fine but pool acquisitions are stalling on a leak:

text
pg_pool_acquired_connections
pg_pool_idle_connections
pg_pool_max_connections
pg_pool_total_connections

Alert runbook

Each section below is the operator-facing triage guide referenced by the runbook_url annotation on the matching alert in orbitalreg.alerts.yaml. Section anchors are stable so a future rule rename only needs to update the YAML.

High error rate

  • Alert: OrbitalRegHighErrorRate (severity page, subsystem api).
  • Trips when: cluster 5xx rate > 5% over 5m and > 1% over 1h (multi-window burn-rate).
  • Triage:
    1. Open the deep-dive dashboard's "Top routes by 5xx" panel; one route is almost always the source.
    2. Pull the JSON-log line for a recent 5xx via Loki / Splunk filtered on level=error component=api plus the offending route — the canonical schema (item 63) carries the full request identity.
    3. Cross-reference pg-pool saturation and S3 error rate — a 5xx spike is usually downstream.
  • Resolve when: cluster:http_requests:error_rate5m falls below 0.05 for at least the alert's for: window.

High latency

  • Alert: OrbitalRegHighLatency (severity warning, subsystem api).
  • Trips when: any route's p99 latency stays above 1s for 10 minutes.
  • Triage:
    1. Use the "p99 by route" panel on the deep-dive dashboard to find the slow route(s).
    2. Check s3_request_duration_seconds — 95% of latency regressions are S3 backpressure.
    3. Check pg_pool:saturation:ratio — the second-most-common cause is a slow query holding pool connections.
  • Resolve when: the offending route's p99 returns below 1s.

Pg pool near exhausted

  • Alert: OrbitalRegPgPoolNearExhausted (severity warning, subsystem postgres).
  • Trips when: pool saturation > 80% for 5 minutes.
  • Triage:
    1. Inspect pg_pool_acquired_connections vs pg_pool_max_connections — confirm the gauge isn't a stale scrape.
    2. Identify long-running queries via pg_stat_activity (filter on state = 'active' and query_start < now() - interval '30s').
    3. If the queue is genuinely too small for traffic, raise ORBITALREG_PG_MAX_CONNS (and the matching CNPG pooler limit if you're on cloudnativepg).
  • Resolve when: saturation drops below 0.7.

S3 high error rate

  • Alert: OrbitalRegS3HighErrorRate (severity page, subsystem storage).
  • Trips when: any S3 op's error rate > 5% for 10 minutes.
  • Triage:
    1. Break down by err_class: sum by (op, err_class) (rate(s3_request_errors_total[10m])). A spike of auth is credentials; network / timeout is the backend; notfound on head / stat is usually a stale manifest pointer (rare).
    2. Check the MinIO server's own /minio/v2/metrics/cluster endpoint for the upstream view.
    3. Confirm the bucket policy + IAM still match — mc admin policy ls on the cluster identity.
  • Resolve when: S3 error rate drops below 5% on every op.

Scan queue deep

  • Alert: OrbitalRegScanQueueDeep (severity warning, subsystem scan).
  • Trips when: pending + claimed + running > 1000 for 30 minutes.
  • Triage:
    1. Confirm the dispatcher is alive: filter logs on component=scan_dispatcher for activity in the last minute.
    2. Look at scan_jobs_processed_total{outcome="failed"} — a crash-looping scanner can saturate the claim queue.
    3. Per-scanner duration p99 isolates a single backend that's gone slow.
  • Resolve when: the depth falls below 500.

Replication lag high

  • Alert: OrbitalRegReplicationLagHigh (severity warning, subsystem geosync).
  • Trips when: any peer's replication_lag_seconds > 60 for 5 minutes.
  • Triage:
    1. replication_push_errors_total by err_class for the slow peer — peer_5xx / peer_4xx means the remote refused; deliver / throttle means the local pusher couldn't keep up.
    2. replication_outbox_depth confirms whether the producer or the consumer is the bottleneck.
    3. The pusher zeroes the lag gauge on a successful push, so a transient peer outage clears the alert without a config change once recovery completes.
  • Resolve when: lag falls below 60s for the alert's for: window.

License expiring soon

  • Alert: OrbitalRegLicenseExpiringSoon (severity warning, subsystem license).
  • Trips when: license_expires_in_seconds < 30 days.
  • Triage:
    1. Confirm the renewal is in flight with the account team.
    2. Once a fresh license envelope is available, upload via Admin → License → Upload.
    3. Verify the gauge has reset: license_expires_in_seconds / 86400 > 30.

License expired

  • Alert: OrbitalRegLicenseExpired (severity page, subsystem license).
  • Trips when: license_expires_in_seconds ≤ 0 for 5 minutes.
  • Triage: same upload path as above. Most write operations begin refusing once the grace window ends, so this alert page-paths rather than warns.

Retention sweep stuck

  • Alert: OrbitalRegRetentionSweepStuck (severity warning, subsystem retention).
  • Trips when: no retention_sweep_duration_seconds observation in the last 24 hours.
  • Triage:
    1. Filter logs on component=retention_sweeper and look for the last successful tick.
    2. The sweeper runs hourly inside the API process; an absent metric means the goroutine has exited or wedged.
    3. If logs show repeated panics, restart the API; the sweeper will re-arm on next boot.

Scan jobs failing

  • Alert: OrbitalRegScanJobsFailing (severity warning, subsystem scan).
  • Trips when: the failure-outcome share of terminal scan jobs exceeds 20% over 15 minutes, sustained for 30 minutes.
  • Triage:
    1. Per-scanner duration p99 isolates the misbehaving backend.
    2. Filter logs on component=scan_dispatcher and level=error for the crash details.
    3. If a single scanner is the culprit, disable it via Admin → Detection → Scanners until the upstream image is fixed.

Log volume spike

  • Alert: OrbitalRegLogVolumeSpike (severity warning, subsystem logging).
  • Trips when: the aggregate log_lines_emitted_total rate over the last 5 minutes is more than 5× the rolling baseline from the previous hour, sustained for 10 minutes. The baseline is the same counter rated over the prior 1h with offset 1h so steady-state growth (organic load) does not page; a sudden spike does.
  • Common causes:
    • A handler stuck in a retry loop emitting one error per retry.
    • A scanner re-emitting the same error per artifact (typically component=scan_dispatcher).
    • A frontend client-error storm (item 71 Phase E — component=frontend) — usually a render loop or unhandled promise rejection introduced by a recent deploy.
    • A new background worker added without a sensible log level (e.g. an importer logging one line per processed item at info).
  • Triage:
    1. PromQL topk(5, sum by (component, level) (rate(log_lines_emitted_total[5m]))) to find the noisy source.
    2. Pivot the SIEM to that component / level — the offending records will already be in the index.
    3. If component=frontend, the rate-limited /api/v1/client-logs endpoint backstop will eventually shed the spam, but a deploy revert is faster than waiting for the bucket to drain.
    4. For server-side spam, drop the offender to warn via ORBITALREG_LOG_LEVEL only if the rate is actively burning downstream SIEM ingestion budget — otherwise fix the loop and ship.

Verifying the bundle locally

The Makefile targets work from a fresh checkout — useful when operators want to verify the assets before importing them into a locked-down production stack:

bash
# Validates docs/grafana/*.json: schema version, metric refs,
# panel shape. Fails if any panel target references a metric not
# declared in api/internal/metrics/.
make grafana-dashboard-test

# Validates docs/prometheus/orbitalreg.alerts.yaml: alert shape,
# severity / summary / description / runbook_url presence, metric
# refs against the same canonical registry, sanity floors.
make prometheus-alerts-test

Both run inside CI on every change to either the registry or the asset files, so a metric rename can never silently desync the bundled dashboards or alerts.

Out of scope (today)

  • Push-mode export. Prometheus is pull-only; an OTLP push path via the OpenTelemetry Collector lands later, when customer demand surfaces.
  • Per-tenant metrics. Single-tenant assumption holds; a multi-tenant deployment would need a tenant_id label retrofit across the registry.
  • Saturation forecasting. No baked-in capacity-planner panels — the Grafana built-in trend overlays are sufficient until a customer asks for predictive saturation.
  • StatsD / Graphite export. Prometheus only; an in-house exporter is the recommended bridge for legacy observability stacks.

Released under the Apache-2.0 License.