Skip to content

Migrating from Prometheus-only metrics to OpenTelemetry metrics

OrbitalReg has shipped a Prometheus /metrics text endpoint since day one. Item 106 Phase B added a second, OTLP-native pipe for the same series. This page is the playbook for switching the customer's metrics backend from Prometheus to OTel — without dropping a single sample, and without committing to the OTel side until the new pipe is proven.

You probably do not need to migrate

Prometheus on /metrics is a first-class long-term surface, not a deprecated one. Migrate only if the customer's observability stack is already OTLP-native (Honeycomb, Tempo + Mimir, Datadog OTLP, GitLab Observability) and the per-team duplication of scrape configs is becoming a maintenance burden. A single-stack customer that uses Prometheus + Grafana has no functional reason to switch.

What changes — and what does not

SurfaceBeforeDuring dual-writeAfter
GET /metrics text endpointSource of truthLive (Prom scrape continues)Optional — can be left on or shut off
OTLP /v1/metrics pushOffLive (30 s PeriodicReader)Source of truth
ServiceMonitor CRDRequiredRequiredOptional
Grafana / Alertmanager rulesBound to Prom queriesBoth backends queryableRewritten against OTel-derived series
OrbitalRegBackupStale etc.Fire from Prom recording rulesFire from PromFire from the customer's OTel-backend alerting layer

The series themselves are unchanged. Counter values, label sets, and cardinality are identical on both surfaces because the OTel pipe is a producer over the same prometheus.DefaultGatherer that backs /metrics.

The bridge — what is actually doing the work

When OTEL_EXPORTER_OTLP_ENDPOINT (or its metrics-specific override OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set, telemetry.InitMetrics boots an OTel SDK MeterProvider and wires the upstream go.opentelemetry.io/contrib/bridges/prometheus MetricProducer onto a 30 s PeriodicReader. The producer scrapes the in-process Prom registry and re-emits every MetricFamily as an OTel ScopeMetric over OTLP/HTTP.

No call site changes. No per-metric duplication. The Prom text endpoint stays live for as long as the operator wants it. The upstream contrib package treats the bridge as a stable interop layer (both sides are governed by the OTel project, so the wire shape is not going to drift).

Pre-flight checklist

Before flipping any env var:

  1. Inventory dashboards. List every Grafana dashboard that reads OrbitalReg series and tag each one as dashboard-prom or dashboard-otel once it has been rewritten. The Storage and Detection dashboards bundled with the chart are the load-bearing ones — start with those.

  2. Inventory alert rules. Walk the bundled PrometheusRule plus any customer-side custom rules. Every rule needs a parallel expression in the destination backend's query language (PromQL stays in Mimir / Grafana Cloud; Honeycomb needs derived columns; Datadog uses its own monitor query DSL).

  3. Pick the metrics endpoint. Use the metrics-specific override if metrics go to a different backend than traces + logs:

    bash
    # Traces + logs → Honeycomb, metrics → Grafana Mimir
    OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
    OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
    OTEL_EXPORTER_OTLP_METRICS_HEADERS=Authorization=Basic\ $MIMIR_AUTH
  4. Open the egress allowlist. In Admin → System → Egress allowlist, tick opentelemetry. The bridge stays no-op until both the env var and the allowlist are open — either gate alone keeps the customer's air-gapped posture intact.

Enable the OTel pipe while leaving the Prom scrape in place.

bash
# values.yaml — Helm
extraEnv:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: https://api.honeycomb.io
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: http/protobuf
  - name: OTEL_EXPORTER_OTLP_HEADERS
    valueFrom:
      secretKeyRef: { name: honeycomb-creds, key: headers }
  - name: OTEL_SERVICE_NAME
    value: orbitalreg-api

Verify the bridge is up:

bash
kubectl logs deploy/orbitalreg-api | grep "opentelemetry metrics bridge"
# → "opentelemetry metrics bridge enabled (endpoint=https://api.honeycomb.io interval=30s)"

Spot-check that a known counter (orbitalreg_artifact_uploads_total) has the same value on both surfaces:

bash
# Prom
curl -s http://orbitalreg-api:8080/metrics | grep -E '^orbitalreg_artifact_uploads_total\{'

# OTel backend (Honeycomb shown; replace with the destination's query API)
curl -s -H "x-honeycomb-team: $HONEYCOMB_API_KEY" \
  "https://api.honeycomb.io/1/queries/orbitalreg-api" \
  -d '{"breakdowns": [], "calculations": [{"op": "SUM", "column": "orbitalreg_artifact_uploads_total"}], "time_range": 60}'

The two surfaces should agree to within one 30 s scrape interval.

Rebuild dashboards + alerts

During the dual-write window, walk each dashboard-prom and produce its OTel counterpart side-by-side. Run them in parallel — same panels, two data sources — and watch for drift.

For each alert rule:

  1. Translate the PromQL expression into the destination backend's language.
  2. Set the new rule to silent (notify a triage channel only) for one full alerting cycle (usually 24–48 h).
  3. Compare firing patterns. The Prom rule and the OTel rule should alert on the same incidents within seconds.
  4. Promote the OTel rule to paging once a clean cycle has passed.

Once dashboards + alerts are validated:

  1. Cut over the on-call routing. Point PagerDuty / Slack integrations at the OTel-backed alerts. The Prom rules stay active but route to an archive-only sink.
  2. Disable the Prom recording rules on the customer-side Prometheus / Mimir / Thanos cluster. Keep the raw scrape — it is the only thing the Prom-only rollback path depends on.
  3. Update runbook URLs. The bundled runbooks (/operations/runbooks/orbitalreg-api-down etc.) are query-language- agnostic and stay correct. Customer-authored runbooks that contain PromQL snippets need a parallel OTel-flavoured paragraph.

Phase 3 — Decommission Prometheus (optional)

This phase is the customer's call, not an OrbitalReg requirement. Reasons to keep /metrics on:

  • A second team scrapes the endpoint for capacity planning.
  • The customer wants a local-disk failover in case the OTel backend has an outage.
  • A compliance auditor wants the metrics endpoint behind the same network ACL as the rest of the API surface (Prom scrape is a single GET; OTel push is harder to model in some firewall stacks).

To decommission cleanly:

  1. Remove the ServiceMonitor CRD instance that targets /metrics. The chart's monitoring.serviceMonitor.enabled value toggles this.
  2. Leave the /metrics handler mounted. It costs ~50 KB of resident memory and is queried by /health/ready for self-check. Turning it off requires a code change which is out of scope for this migration.
  3. Tear down the customer-side Prom stack (Prom server, Alertmanager, Pushgateway) only after the alerting cut-over has survived one paging cycle on the OTel side. Most customers keep Prom in cold-standby for two release cycles.

Rollback — what to do if the OTel side wobbles

The dual-write window is the safety net. If the OTel pipe drops samples, mis-aggregates, or hits a backend-side rate limit:

  1. Re-promote the Prom rules to paging. They never stopped working — only the routing changed.
  2. Unset OTEL_EXPORTER_OTLP_ENDPOINT (or set the metrics-specific override to empty). The bridge logs opentelemetry metrics bridge disabled and the SDK shuts down cleanly. No restart required for graceful drain; a rolling restart is preferred to release the SDK's PeriodicReader goroutine.
  3. Close the egress allowlist for opentelemetry if the failure was an upstream-backend incident, to make the rollback visible in the audit log.

Operator-driven checklist

text
[ ] Dashboards rewritten (Storage, Detection, API overview, customer-side)
[ ] Alert rules translated, running silent for ≥ 24 h
[ ] OTel backend acknowledged a known counter within 30 s of Prom
[ ] PagerDuty / Slack integrations cut over
[ ] Prom recording rules disabled on customer-side Prom cluster
[ ] Runbooks updated with OTel-flavoured queries
[ ] ServiceMonitor CRD removed (or kept on, customer choice)
[ ] One full paging cycle survived on OTel-side rules

See also

Released under the Apache-2.0 License.