Migrating from Prometheus-only metrics to OpenTelemetry metrics
OrbitalReg has shipped a Prometheus /metrics text endpoint since day one. Item 106 Phase B added a second, OTLP-native pipe for the same series. This page is the playbook for switching the customer's metrics backend from Prometheus to OTel — without dropping a single sample, and without committing to the OTel side until the new pipe is proven.
You probably do not need to migrate
Prometheus on /metrics is a first-class long-term surface, not a deprecated one. Migrate only if the customer's observability stack is already OTLP-native (Honeycomb, Tempo + Mimir, Datadog OTLP, GitLab Observability) and the per-team duplication of scrape configs is becoming a maintenance burden. A single-stack customer that uses Prometheus + Grafana has no functional reason to switch.
What changes — and what does not
| Surface | Before | During dual-write | After |
|---|---|---|---|
GET /metrics text endpoint | Source of truth | Live (Prom scrape continues) | Optional — can be left on or shut off |
OTLP /v1/metrics push | Off | Live (30 s PeriodicReader) | Source of truth |
| ServiceMonitor CRD | Required | Required | Optional |
| Grafana / Alertmanager rules | Bound to Prom queries | Both backends queryable | Rewritten against OTel-derived series |
OrbitalRegBackupStale etc. | Fire from Prom recording rules | Fire from Prom | Fire from the customer's OTel-backend alerting layer |
The series themselves are unchanged. Counter values, label sets, and cardinality are identical on both surfaces because the OTel pipe is a producer over the same prometheus.DefaultGatherer that backs /metrics.
The bridge — what is actually doing the work
When OTEL_EXPORTER_OTLP_ENDPOINT (or its metrics-specific override OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set, telemetry.InitMetrics boots an OTel SDK MeterProvider and wires the upstream go.opentelemetry.io/contrib/bridges/prometheus MetricProducer onto a 30 s PeriodicReader. The producer scrapes the in-process Prom registry and re-emits every MetricFamily as an OTel ScopeMetric over OTLP/HTTP.
No call site changes. No per-metric duplication. The Prom text endpoint stays live for as long as the operator wants it. The upstream contrib package treats the bridge as a stable interop layer (both sides are governed by the OTel project, so the wire shape is not going to drift).
Pre-flight checklist
Before flipping any env var:
Inventory dashboards. List every Grafana dashboard that reads OrbitalReg series and tag each one as
dashboard-promordashboard-otelonce it has been rewritten. TheStorageandDetectiondashboards bundled with the chart are the load-bearing ones — start with those.Inventory alert rules. Walk the bundled
PrometheusRuleplus any customer-side custom rules. Every rule needs a parallel expression in the destination backend's query language (PromQL stays in Mimir / Grafana Cloud; Honeycomb needs derived columns; Datadog uses its own monitor query DSL).Pick the metrics endpoint. Use the metrics-specific override if metrics go to a different backend than traces + logs:
bash# Traces + logs → Honeycomb, metrics → Grafana Mimir OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp OTEL_EXPORTER_OTLP_METRICS_HEADERS=Authorization=Basic\ $MIMIR_AUTHOpen the egress allowlist. In Admin → System → Egress allowlist, tick opentelemetry. The bridge stays no-op until both the env var and the allowlist are open — either gate alone keeps the customer's air-gapped posture intact.
Phase 1 — Dual-write window (recommended: 2 weeks)
Enable the OTel pipe while leaving the Prom scrape in place.
# values.yaml — Helm
extraEnv:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: https://api.honeycomb.io
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: http/protobuf
- name: OTEL_EXPORTER_OTLP_HEADERS
valueFrom:
secretKeyRef: { name: honeycomb-creds, key: headers }
- name: OTEL_SERVICE_NAME
value: orbitalreg-apiVerify the bridge is up:
kubectl logs deploy/orbitalreg-api | grep "opentelemetry metrics bridge"
# → "opentelemetry metrics bridge enabled (endpoint=https://api.honeycomb.io interval=30s)"Spot-check that a known counter (orbitalreg_artifact_uploads_total) has the same value on both surfaces:
# Prom
curl -s http://orbitalreg-api:8080/metrics | grep -E '^orbitalreg_artifact_uploads_total\{'
# OTel backend (Honeycomb shown; replace with the destination's query API)
curl -s -H "x-honeycomb-team: $HONEYCOMB_API_KEY" \
"https://api.honeycomb.io/1/queries/orbitalreg-api" \
-d '{"breakdowns": [], "calculations": [{"op": "SUM", "column": "orbitalreg_artifact_uploads_total"}], "time_range": 60}'The two surfaces should agree to within one 30 s scrape interval.
Rebuild dashboards + alerts
During the dual-write window, walk each dashboard-prom and produce its OTel counterpart side-by-side. Run them in parallel — same panels, two data sources — and watch for drift.
For each alert rule:
- Translate the PromQL expression into the destination backend's language.
- Set the new rule to silent (notify a triage channel only) for one full alerting cycle (usually 24–48 h).
- Compare firing patterns. The Prom rule and the OTel rule should alert on the same incidents within seconds.
- Promote the OTel rule to paging once a clean cycle has passed.
Phase 2 — Promote the OTel pipe (recommended: 1 week)
Once dashboards + alerts are validated:
- Cut over the on-call routing. Point PagerDuty / Slack integrations at the OTel-backed alerts. The Prom rules stay active but route to an archive-only sink.
- Disable the Prom recording rules on the customer-side Prometheus / Mimir / Thanos cluster. Keep the raw scrape — it is the only thing the Prom-only rollback path depends on.
- Update runbook URLs. The bundled runbooks (
/operations/runbooks/orbitalreg-api-downetc.) are query-language- agnostic and stay correct. Customer-authored runbooks that contain PromQL snippets need a parallel OTel-flavoured paragraph.
Phase 3 — Decommission Prometheus (optional)
This phase is the customer's call, not an OrbitalReg requirement. Reasons to keep /metrics on:
- A second team scrapes the endpoint for capacity planning.
- The customer wants a local-disk failover in case the OTel backend has an outage.
- A compliance auditor wants the metrics endpoint behind the same network ACL as the rest of the API surface (Prom scrape is a single GET; OTel push is harder to model in some firewall stacks).
To decommission cleanly:
- Remove the ServiceMonitor CRD instance that targets
/metrics. The chart'smonitoring.serviceMonitor.enabledvalue toggles this. - Leave the
/metricshandler mounted. It costs ~50 KB of resident memory and is queried by/health/readyfor self-check. Turning it off requires a code change which is out of scope for this migration. - Tear down the customer-side Prom stack (Prom server, Alertmanager, Pushgateway) only after the alerting cut-over has survived one paging cycle on the OTel side. Most customers keep Prom in cold-standby for two release cycles.
Rollback — what to do if the OTel side wobbles
The dual-write window is the safety net. If the OTel pipe drops samples, mis-aggregates, or hits a backend-side rate limit:
- Re-promote the Prom rules to paging. They never stopped working — only the routing changed.
- Unset
OTEL_EXPORTER_OTLP_ENDPOINT(or set the metrics-specific override to empty). The bridge logsopentelemetry metrics bridge disabledand the SDK shuts down cleanly. No restart required for graceful drain; a rolling restart is preferred to release the SDK'sPeriodicReadergoroutine. - Close the egress allowlist for
opentelemetryif the failure was an upstream-backend incident, to make the rollback visible in the audit log.
Operator-driven checklist
[ ] Dashboards rewritten (Storage, Detection, API overview, customer-side)
[ ] Alert rules translated, running silent for ≥ 24 h
[ ] OTel backend acknowledged a known counter within 30 s of Prom
[ ] PagerDuty / Slack integrations cut over
[ ] Prom recording rules disabled on customer-side Prom cluster
[ ] Runbooks updated with OTel-flavoured queries
[ ] ServiceMonitor CRD removed (or kept on, customer choice)
[ ] One full paging cycle survived on OTel-side rulesSee also
- Observability — signals, defaults, and egress gating
- Monitoring (metrics + alerts) — full metric catalogue and alert reference
- Air-gapped operations — egress allowlist mechanics