Migrating from Prometheus-only metrics to OpenTelemetry metrics

OrbitalReg has shipped a Prometheus /metrics text endpoint since day one. Item 106 Phase B added a second, OTLP-native pipe for the same series. This page is the playbook for switching the customer's metrics backend from Prometheus to OTel — without dropping a single sample, and without committing to the OTel side until the new pipe is proven.

You probably do not need to migrate

Prometheus on /metrics is a first-class long-term surface, not a deprecated one. Migrate only if the customer's observability stack is already OTLP-native (Honeycomb, Tempo + Mimir, Datadog OTLP, GitLab Observability) and the per-team duplication of scrape configs is becoming a maintenance burden. A single-stack customer that uses Prometheus + Grafana has no functional reason to switch.

What changes — and what does not

Surface	Before	During dual-write	After
`GET /metrics` text endpoint	Source of truth	Live (Prom scrape continues)	Optional — can be left on or shut off
OTLP `/v1/metrics` push	Off	Live (30 s `PeriodicReader`)	Source of truth
ServiceMonitor CRD	Required	Required	Optional
Grafana / Alertmanager rules	Bound to Prom queries	Both backends queryable	Rewritten against OTel-derived series
`OrbitalRegBackupStale` etc.	Fire from Prom recording rules	Fire from Prom	Fire from the customer's OTel-backend alerting layer

The series themselves are unchanged. Counter values, label sets, and cardinality are identical on both surfaces because the OTel pipe is a producer over the same prometheus.DefaultGatherer that backs /metrics.

The bridge — what is actually doing the work

When OTEL_EXPORTER_OTLP_ENDPOINT (or its metrics-specific override OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set, telemetry.InitMetrics boots an OTel SDK MeterProvider and wires the upstream go.opentelemetry.io/contrib/bridges/prometheus MetricProducer onto a 30 s PeriodicReader. The producer scrapes the in-process Prom registry and re-emits every MetricFamily as an OTel ScopeMetric over OTLP/HTTP.

No call site changes. No per-metric duplication. The Prom text endpoint stays live for as long as the operator wants it. The upstream contrib package treats the bridge as a stable interop layer (both sides are governed by the OTel project, so the wire shape is not going to drift).

Pre-flight checklist

Before flipping any env var:

Inventory dashboards. List every Grafana dashboard that reads OrbitalReg series and tag each one as dashboard-prom or dashboard-otel once it has been rewritten. The Storage and Detection dashboards bundled with the chart are the load-bearing ones — start with those.
Inventory alert rules. Walk the bundled PrometheusRule plus any customer-side custom rules. Every rule needs a parallel expression in the destination backend's query language (PromQL stays in Mimir / Grafana Cloud; Honeycomb needs derived columns; Datadog uses its own monitor query DSL).

Pick the metrics endpoint. Use the metrics-specific override if metrics go to a different backend than traces + logs:

bash

# Traces + logs → Honeycomb, metrics → Grafana Mimir
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
OTEL_EXPORTER_OTLP_METRICS_HEADERS=Authorization=Basic\ $MIMIR_AUTH

Open the egress allowlist. In Admin → System → Egress allowlist, tick opentelemetry. The bridge stays no-op until both the env var and the allowlist are open — either gate alone keeps the customer's air-gapped posture intact.

Phase 1 — Dual-write window (recommended: 2 weeks)

Enable the OTel pipe while leaving the Prom scrape in place.

bash

# values.yaml — Helm
extraEnv:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: https://api.honeycomb.io
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: http/protobuf
  - name: OTEL_EXPORTER_OTLP_HEADERS
    valueFrom:
      secretKeyRef: { name: honeycomb-creds, key: headers }
  - name: OTEL_SERVICE_NAME
    value: orbitalreg-api

Verify the bridge is up:

bash

kubectl logs deploy/orbitalreg-api | grep "opentelemetry metrics bridge"
# → "opentelemetry metrics bridge enabled (endpoint=https://api.honeycomb.io interval=30s)"

Spot-check that a known counter (orbitalreg_artifact_uploads_total) has the same value on both surfaces:

bash

# Prom
curl -s http://orbitalreg-api:8080/metrics | grep -E '^orbitalreg_artifact_uploads_total\{'

# OTel backend (Honeycomb shown; replace with the destination's query API)
curl -s -H "x-honeycomb-team: $HONEYCOMB_API_KEY" \
  "https://api.honeycomb.io/1/queries/orbitalreg-api" \
  -d '{"breakdowns": [], "calculations": [{"op": "SUM", "column": "orbitalreg_artifact_uploads_total"}], "time_range": 60}'

The two surfaces should agree to within one 30 s scrape interval.

Rebuild dashboards + alerts

During the dual-write window, walk each dashboard-prom and produce its OTel counterpart side-by-side. Run them in parallel — same panels, two data sources — and watch for drift.

For each alert rule:

Translate the PromQL expression into the destination backend's language.
Set the new rule to silent (notify a triage channel only) for one full alerting cycle (usually 24–48 h).
Compare firing patterns. The Prom rule and the OTel rule should alert on the same incidents within seconds.
Promote the OTel rule to paging once a clean cycle has passed.

Phase 2 — Promote the OTel pipe (recommended: 1 week)

Once dashboards + alerts are validated:

Cut over the on-call routing. Point PagerDuty / Slack integrations at the OTel-backed alerts. The Prom rules stay active but route to an archive-only sink.
Disable the Prom recording rules on the customer-side Prometheus / Mimir / Thanos cluster. Keep the raw scrape — it is the only thing the Prom-only rollback path depends on.
Update runbook URLs. The bundled runbooks (/operations/runbooks/orbitalreg-api-down etc.) are query-language- agnostic and stay correct. Customer-authored runbooks that contain PromQL snippets need a parallel OTel-flavoured paragraph.

Phase 3 — Decommission Prometheus (optional)

This phase is the customer's call, not an OrbitalReg requirement. Reasons to keep /metrics on:

A second team scrapes the endpoint for capacity planning.
The customer wants a local-disk failover in case the OTel backend has an outage.
A compliance auditor wants the metrics endpoint behind the same network ACL as the rest of the API surface (Prom scrape is a single GET; OTel push is harder to model in some firewall stacks).

To decommission cleanly:

Remove the ServiceMonitor CRD instance that targets /metrics. The chart's monitoring.serviceMonitor.enabled value toggles this.
Leave the /metrics handler mounted. It costs ~50 KB of resident memory and is queried by /health/ready for self-check. Turning it off requires a code change which is out of scope for this migration.
Tear down the customer-side Prom stack (Prom server, Alertmanager, Pushgateway) only after the alerting cut-over has survived one paging cycle on the OTel side. Most customers keep Prom in cold-standby for two release cycles.

Rollback — what to do if the OTel side wobbles

The dual-write window is the safety net. If the OTel pipe drops samples, mis-aggregates, or hits a backend-side rate limit:

Re-promote the Prom rules to paging. They never stopped working — only the routing changed.
Unset OTEL_EXPORTER_OTLP_ENDPOINT (or set the metrics-specific override to empty). The bridge logs opentelemetry metrics bridge disabled and the SDK shuts down cleanly. No restart required for graceful drain; a rolling restart is preferred to release the SDK's PeriodicReader goroutine.
Close the egress allowlist for opentelemetry if the failure was an upstream-backend incident, to make the rollback visible in the audit log.

Operator-driven checklist

text

[ ] Dashboards rewritten (Storage, Detection, API overview, customer-side)
[ ] Alert rules translated, running silent for ≥ 24 h
[ ] OTel backend acknowledged a known counter within 30 s of Prom
[ ] PagerDuty / Slack integrations cut over
[ ] Prom recording rules disabled on customer-side Prom cluster
[ ] Runbooks updated with OTel-flavoured queries
[ ] ServiceMonitor CRD removed (or kept on, customer choice)
[ ] One full paging cycle survived on OTel-side rules

Migrating from Prometheus-only metrics to OpenTelemetry metrics ​

What changes — and what does not ​

The bridge — what is actually doing the work ​

Pre-flight checklist ​

Phase 1 — Dual-write window (recommended: 2 weeks) ​

Rebuild dashboards + alerts ​

Phase 2 — Promote the OTel pipe (recommended: 1 week) ​

Phase 3 — Decommission Prometheus (optional) ​

Rollback — what to do if the OTel side wobbles ​

Operator-driven checklist ​

See also ​

Migrating from Prometheus-only metrics to OpenTelemetry metrics

What changes — and what does not

The bridge — what is actually doing the work

Pre-flight checklist

Phase 1 — Dual-write window (recommended: 2 weeks)

Rebuild dashboards + alerts

Phase 2 — Promote the OTel pipe (recommended: 1 week)

Phase 3 — Decommission Prometheus (optional)

Rollback — what to do if the OTel side wobbles

Operator-driven checklist

See also