Skip to content

Observability

OrbitalReg emits all three OpenTelemetry signals — metrics, logs, and traces — out of one OTLP/HTTP endpoint, plus a fourth browser-side stream (Real-User-Monitoring) that joins to the backend traces by W3C trace-context propagation. The Prometheus /metrics text endpoint remains in place; the OTel pipe is a parallel bridge, not a replacement, so a customer running Prometheus today does not lose anything by switching the OTel-export flag on.

Air-gapped default

Every OTel surface (logs, metrics, traces, RUM) is default-OFF. The bridge stays no-op as long as OTEL_EXPORTER_OTLP_ENDPOINT is empty, and even with the env var set the egress is blocked by the air-gapped EgressGate until the operator opens Admin → System → Egress allowlist → opentelemetry. Both gates must be open; either one closes the pipe.

The four signals

SignalSourceTransportDefault stateSample env
Metricsapi/internal/metrics/* (Prom)Prom /metrics text + OTLP bridgeProm on, OTLP offOTEL_EXPORTER_OTLP_METRICS_ENDPOINT
Logsapi/internal/applog (slog)stdout JSON + OTLP bridgestdout on, OTLP offOTEL_EXPORTER_OTLP_ENDPOINT
Tracesotelhttp + pgx + S3 + adaptersOTLP/HTTPOTLP offOTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER
RUM (browser)frontend/src/lib/telemetry.tsOTLP/HTTP/JSONOTLP offVITE_OTEL_EXPORTER_OTLP_ENDPOINT

The first three converge on the same backend OTLP endpoint. RUM uses a separate URL because the browser cannot reach an in-cluster collector — operators usually front it with an nginx /otel-collector location that proxies into the cluster.

Metrics

The API exposes Prometheus metrics at /metrics. The chart ships a ServiceMonitor CRD; a scrape interval of 30 s is sufficient for production.

When OTEL_EXPORTER_OTLP_ENDPOINT (or the metrics-specific override OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set and egress is allowed, the same series are bridged onto the OTLP pipe by an internal prometheus → OTel MetricProducer running on a 30 s PeriodicReader. No call-site change is required and the Prom text endpoint stays live, so the two surfaces dual-write until a customer chooses to decommission the Prom scrape (see Migrating from Prometheus to OTel metrics).

For the full metric catalogue (every series declared in api/internal/metrics/, with labels and cardinality notes), the Grafana import recipe, three Alertmanager routing examples, sample PromQL for the five most common triage questions, and a section-per- alert runbook bound to the bundled runbook_url annotations, see Monitoring.

Key metrics

MetricTypeNotes
orbitalreg_http_requests_total{handler,status}counterPer chi handler, per status class
orbitalreg_http_request_duration_seconds{handler}histogramp50 / p95 / p99 derivable
orbitalreg_scan_jobs_pendinggaugeDetection queue depth
orbitalreg_scan_jobs_duration_seconds{scanner}histogramPer-scanner latency
orbitalreg_artifact_uploads_total{format,outcome}counterUpload throughput per format
orbitalreg_artifact_downloads_total{format,outcome}counterDownload throughput per format
orbitalreg_security_block_hits_total{block_id}counterWhich blocks are firing
orbitalreg_retention_deletions_total{policy_id}counterWhat retention is pruning
orbitalreg_db_query_duration_seconds{query}histogramTop-N labelled queries
orbitalreg_backup_verification_last_success_timestampgaugeUsed by the BackupStale alert
orbitalreg_webhook_delivery_duration_secondshistogramp99 for outbound webhook deliveries
orbitalreg_webhook_delivery_failures_totalcounterSubscriptions in trouble

Grafana dashboards

The chart ships three Grafana dashboards as ConfigMaps:

  • OrbitalReg / API overview — request rate, error rate, p99
  • OrbitalReg / Detection — scan-queue depth, finding count by severity
  • OrbitalReg / Storage — S3 byte counts, dedup ratio, retention deletions

Set monitoring.grafanaDashboards.enabled=true and label them so your sidecar picks them up.

Logs

The API logs JSON to stdout. Every line carries level, ts, msg, and a structured payload. The fields you'll most often filter on:

  • request_id — UUID issued at request entry, propagated through DB queries
  • user_id / service_account_id — the principal making the request
  • project_id / repo_id — for artifact-scoped operations
  • trace_id — when OTel is on, ties to the corresponding span

A typical Loki query:

text
{app="orbitalreg-api"} |= "scan failed" | json | scanner="trivy"

When OTEL_EXPORTER_OTLP_ENDPOINT is set and egress is allowed, the same slog records are mirrored as OTLP LogRecords by the applog.OTelBridgeHandler (see Item 106 Phase A in the roadmap). The mirror keeps WithAttrs / WithGroup preambles intact, applies the canonical slog→OTel severity mapping, and is a no-op while the global provider is the OTel-spec Noop default. See Structured logs for the per-field schema, the SIEM recipes (Loki / Elasticsearch / Splunk), and the client-error forwarder.

Traces

otelhttp wraps every inbound request in a span named after the matched chi route pattern (so GET /api/v1/projects/{id}, not GET /api/v1/projects/abc-…). Pgx and the S3 client are wired through the same provider — adapter-level spans are off by default and opt-in via OTEL_INSTRUMENTATION_PGX=true / OTEL_INSTRUMENTATION_S3=true. Format-adapter spans (maven.upsert, docker.put_blob, npm.read_blob, …) ride on the shared formatutil.StartSpan helper and need no per-adapter env flag.

Enable by setting:

bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

The sampler in front of the env-driven ratio adds two always-on overrides: /health/* probes and PUT/POST /api/v1/artifacts/* uploads are always recorded regardless of the ratio. See Trace sampling for the full decision table and tuning recipes.

trace_id correlation in the database

Migration 105_scan_findings_trace_id and its companion add a nullable trace_id TEXT column plus a partial index (WHERE trace_id IS NOT NULL) to both scan_findings and artifact_pulls. The write path captures the W3C trace-id synchronously from the request context, so an operator with the OTLP backend in one tab and psql in the other can jump bidirectionally:

sql
-- From a backend trace, find every Detection finding it produced
SELECT artifact_id, severity, cve_id
  FROM scan_findings
 WHERE trace_id = '0af7651916cd43dd8448eb211c80319c';

-- From a pull row, jump back to the originating trace
SELECT trace_id
  FROM artifact_pulls
 WHERE artifact_id = $1
   AND pulled_at > now() - interval '15 minutes';

The columns hold NULL when egress is off (the OTel SDK never started a span), so the partial index stays small for the typical air-gapped install.

Frontend RUM

frontend/src/lib/telemetry.ts is a hand-rolled W3C trace-context propagator + OTLP/HTTP/JSON exporter. It wraps window.fetch to stamp traceparent (and optional tracestate) on every same-origin outbound call so a single trace runs end-to-end from a button click through the chi handler, the pgx query, and the S3 PutObject.

The hand-roll exists because the four @opentelemetry/sdk-trace-web contrib packages bundle to ~120 KiB gz after tree-shaking, which is half the entire app-shell budget. We only need the two W3C interop standards (trace-context propagation + OTLP/HTTP/JSON export), so the module is small enough to maintain and the file header documents how to swap to the SDK later without touching call sites.

Spans land in a 64-slot capped queue and flush on three triggers: half-full, an opportunistic tick, and pagehide (with keepalive: true so the last batch survives a tab close).

Enable by setting at build time:

bash
VITE_OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.example.com/v1/traces
VITE_OTEL_SERVICE_NAME=orbitalreg-portal

…and proxying the path through nginx so the browser never speaks to the in-cluster collector directly:

nginx
location /otel-collector/ {
    proxy_pass         http://otel-collector.svc.cluster.local:4318/;
    proxy_set_header   Host $host;
    proxy_read_timeout 10s;
    client_max_body_size 1m;
}

When VITE_OTEL_EXPORTER_OTLP_ENDPOINT is empty (the default), the module installs the fetch wrapper as a no-op — traceparent is not stamped, no spans are queued, and no network calls are made. This keeps an air-gapped portal posture-clean.

Setup recipes — four common backends

The same OTLP/HTTP endpoint receives traces, metrics, and logs. The recipes below cover the four backends the OrbitalReg field team has seen most often. Each block uses the shared env-var set; the metrics- specific override is shown where it matters for a customer who wants metrics on a separate endpoint (e.g., Prometheus remote-write vs. a managed APM).

Honeycomb

bash
# Traces + logs + metrics via the same Honeycomb OTLP endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=$HONEYCOMB_API_KEY,x-honeycomb-dataset=orbitalreg-api"
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Honeycomb auto-derives metrics datasets from the OTLP metric stream and shows logs alongside the trace timeline. The dataset header is the only Honeycomb-specific configuration.

Grafana Tempo + Loki + Mimir (or Grafana Cloud)

bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic $(echo -n $INSTANCE_ID:$API_KEY | base64)"
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_RESOURCE_ATTRIBUTES="service.namespace=registry,deployment.environment=prod"

Grafana's OTLP gateway fans the signals out to Tempo (traces), Loki (logs), and Mimir (metrics) automatically. For self-hosted clusters, replace the gateway URL with your own collector endpoint and drop the auth header.

Datadog

bash
# Datadog's OTLP/HTTP endpoint sits inside the Datadog Agent. Run the
# Agent as a DaemonSet with the OTLP/HTTP receiver enabled.
OTEL_EXPORTER_OTLP_ENDPOINT=http://$NODE_IP:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_RESOURCE_ATTRIBUTES="env=prod,version=$ORBITALREG_VERSION"

The Datadog Agent terminates the OTLP/HTTP stream and re-emits to the Datadog backend with the tenant's API key. The Agent's own config turns OTLP receivers on:

yaml
# datadog.yaml
otlp_config:
  receiver:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
  traces:
    enabled: true
  logs:
    enabled: true
  metrics:
    enabled: true

GitLab Premium (Distributed Tracing + Observability)

bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://observe.gitlab.com/v3/$PROJECT_ID/ingest/otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="private-token=$GITLAB_TOKEN"
OTEL_SERVICE_NAME=orbitalreg-api

GitLab's Observability product accepts OTLP/HTTP for all three signals; traces appear under Monitor → Tracing, logs under Monitor → Logs, and metrics under Monitor → Metrics.

Verify before promoting to prod

After flipping the egress allowlist, fire a known-shape request (curl -i https://registry/health/ready) and confirm the trace lands in the backend within 10 s. The /health/* paths are sampled at 100 % by the multi-tier sampler, so a missing trace points at the egress gate, not at sampling drop-out.

Health endpoints

EndpointPurpose
/health/liveLiveness — passes if the process is responsive
/health/readyReadiness — DB + Redis + S3 reachable, migrations applied
/health/backupBackup health — last verification within window

Both live and ready are unauthenticated; backup requires admin token.

Alerts

A PrometheusRule ships with the chart covering the canonical 10 incidents:

  • OrbitalRegAPIDown
  • OrbitalRegDBDown
  • OrbitalRegBackupStale
  • OrbitalRegS3MirrorFailing
  • OrbitalRegScanQueueBacklog
  • OrbitalRegHighErrorRate
  • OrbitalRegCertExpiring
  • OrbitalRegHighDiskUsage
  • OrbitalRegSAMLDown
  • OrbitalRegLicenseExpiring

Each alert links to a runbook in this docs site under /operations/runbooks/<alert>.

See also

Released under the Apache-2.0 License.