Observability
OrbitalReg emits all three OpenTelemetry signals — metrics, logs, and traces — out of one OTLP/HTTP endpoint, plus a fourth browser-side stream (Real-User-Monitoring) that joins to the backend traces by W3C trace-context propagation. The Prometheus /metrics text endpoint remains in place; the OTel pipe is a parallel bridge, not a replacement, so a customer running Prometheus today does not lose anything by switching the OTel-export flag on.
Air-gapped default
Every OTel surface (logs, metrics, traces, RUM) is default-OFF. The bridge stays no-op as long as OTEL_EXPORTER_OTLP_ENDPOINT is empty, and even with the env var set the egress is blocked by the air-gapped EgressGate until the operator opens Admin → System → Egress allowlist → opentelemetry. Both gates must be open; either one closes the pipe.
The four signals
| Signal | Source | Transport | Default state | Sample env |
|---|---|---|---|---|
| Metrics | api/internal/metrics/* (Prom) | Prom /metrics text + OTLP bridge | Prom on, OTLP off | OTEL_EXPORTER_OTLP_METRICS_ENDPOINT |
| Logs | api/internal/applog (slog) | stdout JSON + OTLP bridge | stdout on, OTLP off | OTEL_EXPORTER_OTLP_ENDPOINT |
| Traces | otelhttp + pgx + S3 + adapters | OTLP/HTTP | OTLP off | OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER |
| RUM (browser) | frontend/src/lib/telemetry.ts | OTLP/HTTP/JSON | OTLP off | VITE_OTEL_EXPORTER_OTLP_ENDPOINT |
The first three converge on the same backend OTLP endpoint. RUM uses a separate URL because the browser cannot reach an in-cluster collector — operators usually front it with an nginx /otel-collector location that proxies into the cluster.
Metrics
The API exposes Prometheus metrics at /metrics. The chart ships a ServiceMonitor CRD; a scrape interval of 30 s is sufficient for production.
When OTEL_EXPORTER_OTLP_ENDPOINT (or the metrics-specific override OTEL_EXPORTER_OTLP_METRICS_ENDPOINT) is set and egress is allowed, the same series are bridged onto the OTLP pipe by an internal prometheus → OTel MetricProducer running on a 30 s PeriodicReader. No call-site change is required and the Prom text endpoint stays live, so the two surfaces dual-write until a customer chooses to decommission the Prom scrape (see Migrating from Prometheus to OTel metrics).
For the full metric catalogue (every series declared in api/internal/metrics/, with labels and cardinality notes), the Grafana import recipe, three Alertmanager routing examples, sample PromQL for the five most common triage questions, and a section-per- alert runbook bound to the bundled runbook_url annotations, see Monitoring.
Key metrics
| Metric | Type | Notes |
|---|---|---|
orbitalreg_http_requests_total{handler,status} | counter | Per chi handler, per status class |
orbitalreg_http_request_duration_seconds{handler} | histogram | p50 / p95 / p99 derivable |
orbitalreg_scan_jobs_pending | gauge | Detection queue depth |
orbitalreg_scan_jobs_duration_seconds{scanner} | histogram | Per-scanner latency |
orbitalreg_artifact_uploads_total{format,outcome} | counter | Upload throughput per format |
orbitalreg_artifact_downloads_total{format,outcome} | counter | Download throughput per format |
orbitalreg_security_block_hits_total{block_id} | counter | Which blocks are firing |
orbitalreg_retention_deletions_total{policy_id} | counter | What retention is pruning |
orbitalreg_db_query_duration_seconds{query} | histogram | Top-N labelled queries |
orbitalreg_backup_verification_last_success_timestamp | gauge | Used by the BackupStale alert |
orbitalreg_webhook_delivery_duration_seconds | histogram | p99 for outbound webhook deliveries |
orbitalreg_webhook_delivery_failures_total | counter | Subscriptions in trouble |
Grafana dashboards
The chart ships three Grafana dashboards as ConfigMaps:
- OrbitalReg / API overview — request rate, error rate, p99
- OrbitalReg / Detection — scan-queue depth, finding count by severity
- OrbitalReg / Storage — S3 byte counts, dedup ratio, retention deletions
Set monitoring.grafanaDashboards.enabled=true and label them so your sidecar picks them up.
Logs
The API logs JSON to stdout. Every line carries level, ts, msg, and a structured payload. The fields you'll most often filter on:
request_id— UUID issued at request entry, propagated through DB queriesuser_id/service_account_id— the principal making the requestproject_id/repo_id— for artifact-scoped operationstrace_id— when OTel is on, ties to the corresponding span
A typical Loki query:
{app="orbitalreg-api"} |= "scan failed" | json | scanner="trivy"When OTEL_EXPORTER_OTLP_ENDPOINT is set and egress is allowed, the same slog records are mirrored as OTLP LogRecords by the applog.OTelBridgeHandler (see Item 106 Phase A in the roadmap). The mirror keeps WithAttrs / WithGroup preambles intact, applies the canonical slog→OTel severity mapping, and is a no-op while the global provider is the OTel-spec Noop default. See Structured logs for the per-field schema, the SIEM recipes (Loki / Elasticsearch / Splunk), and the client-error forwarder.
Traces
otelhttp wraps every inbound request in a span named after the matched chi route pattern (so GET /api/v1/projects/{id}, not GET /api/v1/projects/abc-…). Pgx and the S3 client are wired through the same provider — adapter-level spans are off by default and opt-in via OTEL_INSTRUMENTATION_PGX=true / OTEL_INSTRUMENTATION_S3=true. Format-adapter spans (maven.upsert, docker.put_blob, npm.read_blob, …) ride on the shared formatutil.StartSpan helper and need no per-adapter env flag.
Enable by setting:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1The sampler in front of the env-driven ratio adds two always-on overrides: /health/* probes and PUT/POST /api/v1/artifacts/* uploads are always recorded regardless of the ratio. See Trace sampling for the full decision table and tuning recipes.
trace_id correlation in the database
Migration 105_scan_findings_trace_id and its companion add a nullable trace_id TEXT column plus a partial index (WHERE trace_id IS NOT NULL) to both scan_findings and artifact_pulls. The write path captures the W3C trace-id synchronously from the request context, so an operator with the OTLP backend in one tab and psql in the other can jump bidirectionally:
-- From a backend trace, find every Detection finding it produced
SELECT artifact_id, severity, cve_id
FROM scan_findings
WHERE trace_id = '0af7651916cd43dd8448eb211c80319c';
-- From a pull row, jump back to the originating trace
SELECT trace_id
FROM artifact_pulls
WHERE artifact_id = $1
AND pulled_at > now() - interval '15 minutes';The columns hold NULL when egress is off (the OTel SDK never started a span), so the partial index stays small for the typical air-gapped install.
Frontend RUM
frontend/src/lib/telemetry.ts is a hand-rolled W3C trace-context propagator + OTLP/HTTP/JSON exporter. It wraps window.fetch to stamp traceparent (and optional tracestate) on every same-origin outbound call so a single trace runs end-to-end from a button click through the chi handler, the pgx query, and the S3 PutObject.
The hand-roll exists because the four @opentelemetry/sdk-trace-web contrib packages bundle to ~120 KiB gz after tree-shaking, which is half the entire app-shell budget. We only need the two W3C interop standards (trace-context propagation + OTLP/HTTP/JSON export), so the module is small enough to maintain and the file header documents how to swap to the SDK later without touching call sites.
Spans land in a 64-slot capped queue and flush on three triggers: half-full, an opportunistic tick, and pagehide (with keepalive: true so the last batch survives a tab close).
Enable by setting at build time:
VITE_OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.example.com/v1/traces
VITE_OTEL_SERVICE_NAME=orbitalreg-portal…and proxying the path through nginx so the browser never speaks to the in-cluster collector directly:
location /otel-collector/ {
proxy_pass http://otel-collector.svc.cluster.local:4318/;
proxy_set_header Host $host;
proxy_read_timeout 10s;
client_max_body_size 1m;
}When VITE_OTEL_EXPORTER_OTLP_ENDPOINT is empty (the default), the module installs the fetch wrapper as a no-op — traceparent is not stamped, no spans are queued, and no network calls are made. This keeps an air-gapped portal posture-clean.
Setup recipes — four common backends
The same OTLP/HTTP endpoint receives traces, metrics, and logs. The recipes below cover the four backends the OrbitalReg field team has seen most often. Each block uses the shared env-var set; the metrics- specific override is shown where it matters for a customer who wants metrics on a separate endpoint (e.g., Prometheus remote-write vs. a managed APM).
Honeycomb
# Traces + logs + metrics via the same Honeycomb OTLP endpoint
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=$HONEYCOMB_API_KEY,x-honeycomb-dataset=orbitalreg-api"
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1Honeycomb auto-derives metrics datasets from the OTLP metric stream and shows logs alongside the trace timeline. The dataset header is the only Honeycomb-specific configuration.
Grafana Tempo + Loki + Mimir (or Grafana Cloud)
OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod.grafana.net/otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic $(echo -n $INSTANCE_ID:$API_KEY | base64)"
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_RESOURCE_ATTRIBUTES="service.namespace=registry,deployment.environment=prod"Grafana's OTLP gateway fans the signals out to Tempo (traces), Loki (logs), and Mimir (metrics) automatically. For self-hosted clusters, replace the gateway URL with your own collector endpoint and drop the auth header.
Datadog
# Datadog's OTLP/HTTP endpoint sits inside the Datadog Agent. Run the
# Agent as a DaemonSet with the OTLP/HTTP receiver enabled.
OTEL_EXPORTER_OTLP_ENDPOINT=http://$NODE_IP:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=orbitalreg-api
OTEL_RESOURCE_ATTRIBUTES="env=prod,version=$ORBITALREG_VERSION"The Datadog Agent terminates the OTLP/HTTP stream and re-emits to the Datadog backend with the tenant's API key. The Agent's own config turns OTLP receivers on:
# datadog.yaml
otlp_config:
receiver:
protocols:
http:
endpoint: 0.0.0.0:4318
traces:
enabled: true
logs:
enabled: true
metrics:
enabled: trueGitLab Premium (Distributed Tracing + Observability)
OTEL_EXPORTER_OTLP_ENDPOINT=https://observe.gitlab.com/v3/$PROJECT_ID/ingest/otlp
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS="private-token=$GITLAB_TOKEN"
OTEL_SERVICE_NAME=orbitalreg-apiGitLab's Observability product accepts OTLP/HTTP for all three signals; traces appear under Monitor → Tracing, logs under Monitor → Logs, and metrics under Monitor → Metrics.
Verify before promoting to prod
After flipping the egress allowlist, fire a known-shape request (curl -i https://registry/health/ready) and confirm the trace lands in the backend within 10 s. The /health/* paths are sampled at 100 % by the multi-tier sampler, so a missing trace points at the egress gate, not at sampling drop-out.
Health endpoints
| Endpoint | Purpose |
|---|---|
/health/live | Liveness — passes if the process is responsive |
/health/ready | Readiness — DB + Redis + S3 reachable, migrations applied |
/health/backup | Backup health — last verification within window |
Both live and ready are unauthenticated; backup requires admin token.
Alerts
A PrometheusRule ships with the chart covering the canonical 10 incidents:
OrbitalRegAPIDownOrbitalRegDBDownOrbitalRegBackupStaleOrbitalRegS3MirrorFailingOrbitalRegScanQueueBacklogOrbitalRegHighErrorRateOrbitalRegCertExpiringOrbitalRegHighDiskUsageOrbitalRegSAMLDownOrbitalRegLicenseExpiring
Each alert links to a runbook in this docs site under /operations/runbooks/<alert>.
See also
- Trace sampling — per-path decision table and tuning
- Monitoring (metrics + alerts) — full metric catalogue
- Structured logs — slog schema and SIEM recipes
- Migrating from Prometheus to OTel metrics — dual-write window, cut-over checklist
- Air-gapped operations — egress allowlist mechanics