Monitoring

OrbitalReg exposes a Prometheus-compatible /metrics endpoint and ships two Grafana dashboards plus a Prometheus alert + recording-rule bundle so a customer-side observability stack can be wired up in a single afternoon. This page is the operator reference: the metric catalogue, the import recipes, the Alertmanager routing examples, and the PromQL snippets the on-call rotation reaches for first.

The high-level Observability page covers logs and traces alongside metrics; this page is the metrics deep-dive.

What ships in the box

Asset	Path	Purpose
`/metrics` endpoint	API process, served by `prometheus/promhttp.Handler`	Pull-mode scrape target for Prometheus
Metric registry	`api/internal/metrics/`	Single source of truth for every series name + label set
Overview dashboard	`docs/grafana/orbitalreg-overview.json`	12-panel RED + USE summary for an SRE morning glance
Deep-dive dashboard	`docs/grafana/orbitalreg-deep-dive.json`	18-panel per-route / per-op / per-peer breakdown
Alert + recording rules	`docs/prometheus/orbitalreg.alerts.yaml`	Ten on-call alerts plus seven pre-aggregated recording rules
Dashboard CI guard	`make grafana-dashboard-test`	Refuses any panel that references an undeclared metric
Alert-rule CI guard	`make prometheus-alerts-test`	Refuses any alert that references an undeclared metric

The CI guards mean that renaming a metric in the registry is a compile-time-style error: the dashboard or rule file fails its test the same commit that drops the underlying series, so a stale asset never reaches a customer install.

Scrape configuration

Point Prometheus at the API process. The endpoint is served on the same listener as the rest of the API (no separate port-bind today) and it does not require authentication so it can be scraped with the customer's existing service-discovery rules.

yaml

# prometheus.yml — minimal scrape job
scrape_configs:
  - job_name: orbitalreg-api
    metrics_path: /metrics
    scrape_interval: 30s
    scrape_timeout: 10s
    static_configs:
      - targets:
          - orbitalreg-api.orbitalreg.svc.cluster.local:8080

Kubernetes deployments using the kube-prometheus-stack get a ServiceMonitor selector for free — match on the app.kubernetes.io/name=orbitalreg-api label that the chart already sets. A 30-second scrape interval is the recommended floor: histogram buckets need ≥4 scrapes inside a 5-minute window for the histogram_quantile() math in the bundled dashboards to be stable.

Metric reference

Every series below is declared in api/internal/metrics/. Cardinality-sensitive labels are noted; everything else is bounded by a finite enum at the call site.

HTTP layer (RED)

Metric	Type	Labels	Cardinality
`http_requests_total`	counter	`method`, `route`, `status`	bounded — `route` is the chi `RoutePattern` (e.g. `/api/v1/repos/{id}`), not the raw URL
`http_request_duration_seconds`	histogram	`method`, `route`, `status`	same as above; default Prometheus buckets
`http_requests_inflight`	gauge	—	single series

The chi middleware in api/internal/middleware/metrics.go wraps every inbound request, so adding a new handler picks up RED coverage automatically.

Database (USE)

Metric	Type	Labels	Notes
`pg_pool_acquired_connections`	gauge	—	Sampled every 10s by `metrics.RunPgPoolPoller`
`pg_pool_idle_connections`	gauge	—	Sampled every 10s
`pg_pool_max_connections`	gauge	—	Pool max from `pgxpool.Stat()`
`pg_pool_total_connections`	gauge	—	acquired + idle + constructing

The recorded ratio pg_pool:saturation:ratio = pg_pool_acquired_connections / pg_pool_max_connections is what the OrbitalRegPgPoolNearExhausted alert fires on.

S3 / MinIO

Metric	Type	Labels	Notes
`s3_requests_total`	counter	`op`	`op` ∈ `put`/`get`/`stat`/`delete`/`copy`/`presign`/`list`/`put_by_key`
`s3_request_duration_seconds`	histogram	`op`	latency buckets tuned for object-store calls (5ms → 30s)
`s3_request_errors_total`	counter	`op`, `err_class`	`err_class` ∈ `notfound`/`auth`/`server`/`client`/`timeout`/`canceled`/`network`/`other`

Folded into startS3Span in internal/storage/s3.go so every minio call records latency + outcome without per-callsite plumbing.

Scan dispatcher

Metric	Type	Labels	Notes
`scan_queue_depth`	gauge	`state`	`state` ∈ `pending`/`claimed`/`running`/`done`/`failed`/`skipped` — sampled every 15s
`scan_jobs_processed_total`	counter	`outcome`	`outcome` ∈ `done`/`failed`/`skipped`
`scan_job_duration_seconds`	histogram	`scanner`	per-scanner wall-clock; bucket range 10ms → 5min

Retention sweeper

Metric	Type	Labels	Notes
`retention_sweep_duration_seconds`	histogram	—	one full sweep cycle, regardless of policy mix
`retention_artifacts_deleted_total`	counter	`repo_id`, `dry_run`	repo-keyed for SRE pivots; complements the operator-named `retention_runs_total{policy_name,outcome}` series

Geo-Sync replication

Metric	Type	Labels	Notes
`replication_lag_seconds`	gauge	`peer`	seconds since last successful push; zeroed on success
`replication_outbox_depth`	gauge	`peer`	`geosync_outbox` rows ahead of the peer's cursor
`replication_push_errors_total`	counter	`peer`, `err_class`	`err_class` is bucketed via `ClassifyReplicationErr` (outbox_read / marshal / request / throttle / deliver / peer_5xx / peer_4xx / other)

peer is the user-defined peer ID; cardinality matches the row count in geosync_peer_state (single-digit on every install we've seen).

License

Metric	Type	Labels	Notes
`license_checks_total`	counter	`outcome`	`outcome` ∈ `trial_active`/`trial_expired`/`licensed_active`/`licensed_expired`/`invalid`/`derive_error`
`license_expires_in_seconds`	gauge	—	negative once expired; drives the renewal alerts

Build identity

Metric	Type	Labels	Notes
`orbitalreg_build_info`	gauge=1	`version`, `commit`, `go_version`, `built_at`	The "what's running here" beacon — pin to a Grafana table panel
`orbitalreg_uptime_seconds`	gauge	—	Ticks once per second; resets on restart

Log volume (item 71 Phase G)

Sampled by the applog metrics handler that wraps the JSON encoder (api/internal/applog/metrics.go); both counters increment per emitted record.

Metric	Type	Labels	Notes
`log_lines_emitted_total`	counter	`component`, `level`	One increment per slog record; cardinality bounded by the canonical applog component enum × `{debug,info,warn,error}`. `component=unknown` is the fallback bucket for records emitted before any `applog.Component` tag.
`log_bytes_emitted_total`	counter	`component`	Approximate JSON-serialised wire bytes (post-redaction); use for SIEM ingestion-budget forecasts.

The OrbitalRegLogVolumeSpike alert pages on a 5× spike over the prior-hour baseline — see the Log volume spike runbook below.

Recording rules

The bundled orbitalreg.alerts.yaml ships seven recording rules so the dashboards (and the alerts themselves) don't pay the histogram- quantile + division cost on every evaluation:

Recording rule	Underlying expression
`route:http_requests:rate5m`	per-route request rate (`sum by (route, method) (rate(http_requests_total[5m]))`)
`route:http_requests:error_rate5m`	per-route 5xx ratio
`route:http_request_duration_seconds:p99_5m`	per-route p99 latency (5m window)
`cluster:http_requests:error_rate5m`	cluster-wide 5xx ratio (5m fast burn)
`cluster:http_requests:error_rate1h`	cluster-wide 5xx ratio (1h slow burn)
`pg_pool:saturation:ratio`	acquired / max — the saturation indicator
`s3:requests:error_rate10m`	per-op S3 error ratio (10m window)

Recording-rule names follow the documented Prometheus convention level:metric:operation so a dashboard reader can read what the series is without opening the YAML.

Grafana — importing the dashboards

The dashboards live in docs/grafana/ and use a templated ${DS_PROMETHEUS} datasource so the import wizard binds cleanly to whatever Prometheus instance you point them at.

Open Grafana → Dashboards → Import.
Click Upload JSON file and select either orbitalreg-overview.json or orbitalreg-deep-dive.json. (You can also paste the file contents into the "Import via panel json" text area.)
On the next screen pick the Prometheus datasource that scrapes /metrics (the ${DS_PROMETHEUS} placeholder is bound here).
Click Import.

To version-control the import, keep the JSON files alongside your Grafana provisioning config:

yaml

# grafana-provisioning/dashboards/orbitalreg.yaml
apiVersion: 1
providers:
  - name: orbitalreg
    folder: OrbitalReg
    type: file
    options:
      path: /var/lib/grafana/dashboards/orbitalreg

…and drop the two JSON files under /var/lib/grafana/dashboards/orbitalreg/. Grafana picks them up at startup and re-syncs whenever the file changes.

Dashboard layout

Overview is the morning glance: RPS, 5xx rate, latency p50/p95/p99, inflight, pg-pool saturation, S3 latency p99 per-op, S3 error rate, scan-queue depth per state, retention sweep p99, replication lag per peer, license expires-in days, build-info table.
Deep-dive is the triage view: per-route topk RPS / p99 / 5xx %, status-code distribution, per-method, pool saturation %, per-op S3 throughput + p95 + error class, scan-queue per state, per-scanner duration p99, outcome counters, retention deletes per repo, per-peer replication lag / outbox / push errors, license checks per outcome, uptime.

The deep-dive dashboard's per-route panels use the route:http_requests:rate5m and route:http_request_duration_seconds:p99_5m recording rules — they will only populate once the rule file is loaded into Prometheus.

Prometheus — loading the alert rules

Drop orbitalreg.alerts.yaml into your Prometheus rule directory and reference it from the main config:

yaml

# prometheus.yml
rule_files:
  - /etc/prometheus/rules/orbitalreg.alerts.yaml

Reload Prometheus (POST /-/reload or SIGHUP) and confirm under Status → Rules that all seven recording rules and ten alert rules show up green. The alert-rule CI guard (make prometheus-alerts-test) parses the same file and refuses any PromQL that references an undeclared metric, so a clean test run is a strong signal the file will load against any Prometheus that scrapes the API.

Severity convention

Every alert carries two routing labels:

severity — page (operator must wake up; SLO burning now) or warning (investigate within business hours).
subsystem — api / postgres / storage / scan / geosync / license / retention. Lets Alertmanager route per team without inspecting the alert name.

Alertmanager routing examples

The bundled rules expose the labels Alertmanager needs to route without any extra wrapping. Three minimal recipes follow — pick the one that matches how your team already does on-call.

PagerDuty (page-severity → wake the operator)

yaml

# alertmanager.yml — PagerDuty for page, Slack for warning
route:
  receiver: default
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: pagerduty-orbitalreg
      continue: false
    - matchers:
        - severity = warning
      receiver: slack-orbitalreg
      continue: false

receivers:
  - name: default
    # silently absorb anything that doesn't match the routes above.

  - name: pagerduty-orbitalreg
    pagerduty_configs:
      - service_key: ${PAGERDUTY_INTEGRATION_KEY}
        description: '{{ .CommonAnnotations.summary }}'
        details:
          subsystem: '{{ .CommonLabels.subsystem }}'
          firing:    '{{ .Alerts.Firing | len }}'
          runbook:   '{{ .CommonAnnotations.runbook_url }}'

  - name: slack-orbitalreg
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#orbitalreg-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: |
          {{ range .Alerts -}}
          *{{ .Labels.alertname }}* — {{ .Annotations.description }}
          <{{ .Annotations.runbook_url }}|Runbook>
          {{ end }}

Slack-only (warnings + pages both go to chat)

For early-stage installs that don't yet have a paging contract:

yaml

route:
  receiver: slack-orbitalreg
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: slack-orbitalreg
    slack_configs:
      - api_url: ${SLACK_WEBHOOK_URL}
        channel: '#orbitalreg-alerts'
        send_resolved: true
        title: '[{{ .CommonLabels.severity | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: |
          {{ range .Alerts -}}
          *{{ .Labels.alertname }}* — {{ .Annotations.description }}
          <{{ .Annotations.runbook_url }}|Runbook>
          {{ end }}

Email-only (air-gapped, no SaaS notifier)

The customer profile that runs OrbitalReg in-cluster typically also runs Postfix or a corporate SMTP relay. Email is the lowest-common- denominator channel and works without any external connectivity:

yaml

global:
  smtp_smarthost: smtp.internal.example.com:587
  smtp_from: alertmanager@example.com
  smtp_auth_username: alertmanager
  smtp_auth_password_file: /etc/alertmanager/smtp.password

route:
  receiver: email-orbitalreg
  group_by: ['alertname', 'subsystem']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity = page
      receiver: email-orbitalreg-page
      continue: false

receivers:
  - name: email-orbitalreg
    email_configs:
      - to: orbitalreg-warning@example.com
        send_resolved: true

  - name: email-orbitalreg-page
    email_configs:
      - to: orbitalreg-oncall@example.com
        send_resolved: true
        # Most corporate mail servers throttle bursts; bumping the
        # group_interval at the route level keeps the second burst
        # arriving 5 minutes after the first.

Sample PromQL — the five most common triage questions

Open the deep-dive dashboard first; these are the queries to paste into Prometheus's expression browser when a panel doesn't tell the whole story.

1. Where are my errors coming from?

Top routes by 5xx rate over the last 15 minutes:

text

topk(5,
  sum by (route) (rate(http_requests_total{status=~"5.."}[15m]))
)

Compare against the per-route 5xx ratio (filters out hot routes that are simply busy):

text

topk(5, route:http_requests:error_rate5m > 0)

2. Why is my latency spiking?

The five worst routes by p99 latency, recording-rule pre-aggregated:

text

topk(5, route:http_request_duration_seconds:p99_5m)

Cross-correlate with backend backpressure to confirm the spike is downstream of the API layer:

text

histogram_quantile(0.99,
  sum by (le, op) (rate(s3_request_duration_seconds_bucket[5m])))

text

pg_pool:saturation:ratio

3. Is the scan dispatcher draining?

Pending + claimed + running depth — should hover near zero on a healthy install:

text

sum(scan_queue_depth{state=~"pending|claimed|running"})

Failure-rate share over the last 15 minutes (≥0.2 trips OrbitalRegScanJobsFailing):

text

sum(rate(scan_jobs_processed_total{outcome="failed"}[15m]))
  /
clamp_min(sum(rate(scan_jobs_processed_total[15m])), 1)

Per-scanner p99 to find the misbehaving backend:

text

histogram_quantile(0.99,
  sum by (le, scanner) (rate(scan_job_duration_seconds_bucket[15m])))

4. Which peer is replication-lagging?

text

topk(3, replication_lag_seconds)

Confirm the producer is healthy by checking the outbox depth — a deep outbox + low push-error rate means the consumer is slow, not the producer:

text

replication_outbox_depth

text

sum by (peer, err_class) (rate(replication_push_errors_total[15m]))

5. How healthy is the database pool?

The headline ratio:

text

pg_pool:saturation:ratio

Connection accounting, useful when the ratio looks fine but pool acquisitions are stalling on a leak:

text

pg_pool_acquired_connections
pg_pool_idle_connections
pg_pool_max_connections
pg_pool_total_connections

Alert runbook

Each section below is the operator-facing triage guide referenced by the runbook_url annotation on the matching alert in orbitalreg.alerts.yaml. Section anchors are stable so a future rule rename only needs to update the YAML.

High error rate

Alert: OrbitalRegHighErrorRate (severity page, subsystem api).
Trips when: cluster 5xx rate > 5% over 5m and > 1% over 1h (multi-window burn-rate).
Triage:
1. Open the deep-dive dashboard's "Top routes by 5xx" panel; one route is almost always the source.
2. Pull the JSON-log line for a recent 5xx via Loki / Splunk filtered on level=error component=api plus the offending route — the canonical schema (item 63) carries the full request identity.
3. Cross-reference pg-pool saturation and S3 error rate — a 5xx spike is usually downstream.
Resolve when: cluster:http_requests:error_rate5m falls below 0.05 for at least the alert's for: window.

High latency

Alert: OrbitalRegHighLatency (severity warning, subsystem api).
Trips when: any route's p99 latency stays above 1s for 10 minutes.
Triage:
1. Use the "p99 by route" panel on the deep-dive dashboard to find the slow route(s).
2. Check s3_request_duration_seconds — 95% of latency regressions are S3 backpressure.
3. Check pg_pool:saturation:ratio — the second-most-common cause is a slow query holding pool connections.
Resolve when: the offending route's p99 returns below 1s.

Pg pool near exhausted

Alert: OrbitalRegPgPoolNearExhausted (severity warning, subsystem postgres).
Trips when: pool saturation > 80% for 5 minutes.
Triage:
1. Inspect pg_pool_acquired_connections vs pg_pool_max_connections — confirm the gauge isn't a stale scrape.
2. Identify long-running queries via pg_stat_activity (filter on state = 'active' and query_start < now() - interval '30s').
3. If the queue is genuinely too small for traffic, raise ORBITALREG_PG_MAX_CONNS (and the matching CNPG pooler limit if you're on cloudnativepg).
Resolve when: saturation drops below 0.7.

S3 high error rate

Alert: OrbitalRegS3HighErrorRate (severity page, subsystem storage).
Trips when: any S3 op's error rate > 5% for 10 minutes.
Triage:
1. Break down by err_class: sum by (op, err_class) (rate(s3_request_errors_total[10m])). A spike of auth is credentials; network / timeout is the backend; notfound on head / stat is usually a stale manifest pointer (rare).
2. Check the MinIO server's own /minio/v2/metrics/cluster endpoint for the upstream view.
3. Confirm the bucket policy + IAM still match — mc admin policy ls on the cluster identity.
Resolve when: S3 error rate drops below 5% on every op.

Scan queue deep

Alert: OrbitalRegScanQueueDeep (severity warning, subsystem scan).
Trips when: pending + claimed + running > 1000 for 30 minutes.
Triage:
1. Confirm the dispatcher is alive: filter logs on component=scan_dispatcher for activity in the last minute.
2. Look at scan_jobs_processed_total{outcome="failed"} — a crash-looping scanner can saturate the claim queue.
3. Per-scanner duration p99 isolates a single backend that's gone slow.
Resolve when: the depth falls below 500.

Replication lag high

Alert: OrbitalRegReplicationLagHigh (severity warning, subsystem geosync).
Trips when: any peer's replication_lag_seconds > 60 for 5 minutes.
Triage:
1. replication_push_errors_total by err_class for the slow peer — peer_5xx / peer_4xx means the remote refused; deliver / throttle means the local pusher couldn't keep up.
2. replication_outbox_depth confirms whether the producer or the consumer is the bottleneck.
3. The pusher zeroes the lag gauge on a successful push, so a transient peer outage clears the alert without a config change once recovery completes.
Resolve when: lag falls below 60s for the alert's for: window.

License expiring soon

Alert: OrbitalRegLicenseExpiringSoon (severity warning, subsystem license).
Trips when: license_expires_in_seconds < 30 days.
Triage:
1. Confirm the renewal is in flight with the account team.
2. Once a fresh license envelope is available, upload via Admin → License → Upload.
3. Verify the gauge has reset: license_expires_in_seconds / 86400 > 30.

License expired

Alert: OrbitalRegLicenseExpired (severity page, subsystem license).
Trips when: license_expires_in_seconds ≤ 0 for 5 minutes.
Triage: same upload path as above. Most write operations begin refusing once the grace window ends, so this alert page-paths rather than warns.

Retention sweep stuck

Alert: OrbitalRegRetentionSweepStuck (severity warning, subsystem retention).
Trips when: no retention_sweep_duration_seconds observation in the last 24 hours.
Triage:
1. Filter logs on component=retention_sweeper and look for the last successful tick.
2. The sweeper runs hourly inside the API process; an absent metric means the goroutine has exited or wedged.
3. If logs show repeated panics, restart the API; the sweeper will re-arm on next boot.

Scan jobs failing

Alert: OrbitalRegScanJobsFailing (severity warning, subsystem scan).
Trips when: the failure-outcome share of terminal scan jobs exceeds 20% over 15 minutes, sustained for 30 minutes.
Triage:
1. Per-scanner duration p99 isolates the misbehaving backend.
2. Filter logs on component=scan_dispatcher and level=error for the crash details.
3. If a single scanner is the culprit, disable it via Admin → Detection → Scanners until the upstream image is fixed.

Log volume spike

Alert: OrbitalRegLogVolumeSpike (severity warning, subsystem logging).
Trips when: the aggregate log_lines_emitted_total rate over the last 5 minutes is more than 5× the rolling baseline from the previous hour, sustained for 10 minutes. The baseline is the same counter rated over the prior 1h with offset 1h so steady-state growth (organic load) does not page; a sudden spike does.
Common causes:
- A handler stuck in a retry loop emitting one error per retry.
- A scanner re-emitting the same error per artifact (typically component=scan_dispatcher).
- A frontend client-error storm (item 71 Phase E — component=frontend) — usually a render loop or unhandled promise rejection introduced by a recent deploy.
- A new background worker added without a sensible log level (e.g. an importer logging one line per processed item at info).
Triage:
1. PromQL topk(5, sum by (component, level) (rate(log_lines_emitted_total[5m]))) to find the noisy source.
2. Pivot the SIEM to that component / level — the offending records will already be in the index.
3. If component=frontend, the rate-limited /api/v1/client-logs endpoint backstop will eventually shed the spam, but a deploy revert is faster than waiting for the bucket to drain.
4. For server-side spam, drop the offender to warn via ORBITALREG_LOG_LEVEL only if the rate is actively burning downstream SIEM ingestion budget — otherwise fix the loop and ship.

Verifying the bundle locally

The Makefile targets work from a fresh checkout — useful when operators want to verify the assets before importing them into a locked-down production stack:

bash

# Validates docs/grafana/*.json: schema version, metric refs,
# panel shape. Fails if any panel target references a metric not
# declared in api/internal/metrics/.
make grafana-dashboard-test

# Validates docs/prometheus/orbitalreg.alerts.yaml: alert shape,
# severity / summary / description / runbook_url presence, metric
# refs against the same canonical registry, sanity floors.
make prometheus-alerts-test

Both run inside CI on every change to either the registry or the asset files, so a metric rename can never silently desync the bundled dashboards or alerts.

Out of scope (today)

Push-mode export. Prometheus is pull-only; an OTLP push path via the OpenTelemetry Collector lands later, when customer demand surfaces.
Per-tenant metrics. Single-tenant assumption holds; a multi-tenant deployment would need a tenant_id label retrofit across the registry.
Saturation forecasting. No baked-in capacity-planner panels — the Grafana built-in trend overlays are sufficient until a customer asks for predictive saturation.
StatsD / Graphite export. Prometheus only; an in-house exporter is the recommended bridge for legacy observability stacks.

Monitoring ​

What ships in the box ​

Scrape configuration ​

Metric reference ​

HTTP layer (RED) ​

Database (USE) ​

S3 / MinIO ​

Scan dispatcher ​

Retention sweeper ​

Geo-Sync replication ​

License ​

Build identity ​

Log volume (item 71 Phase G) ​

Recording rules ​

Grafana — importing the dashboards ​

Dashboard layout ​

Prometheus — loading the alert rules ​

Severity convention ​

Alertmanager routing examples ​

PagerDuty (page-severity → wake the operator) ​

Slack-only (warnings + pages both go to chat) ​

Email-only (air-gapped, no SaaS notifier) ​

Sample PromQL — the five most common triage questions ​

1. Where are my errors coming from? ​

2. Why is my latency spiking? ​

3. Is the scan dispatcher draining? ​

4. Which peer is replication-lagging? ​

5. How healthy is the database pool? ​

Alert runbook ​

High error rate ​

High latency ​

Pg pool near exhausted ​

S3 high error rate ​

Scan queue deep ​

Replication lag high ​

License expiring soon ​

License expired ​

Retention sweep stuck ​

Scan jobs failing ​

Log volume spike ​

Verifying the bundle locally ​

Out of scope (today) ​

Monitoring

What ships in the box

Scrape configuration

Metric reference

HTTP layer (RED)

Database (USE)

S3 / MinIO

Scan dispatcher

Retention sweeper

Geo-Sync replication

License

Build identity

Log volume (item 71 Phase G)

Recording rules

Grafana — importing the dashboards

Dashboard layout

Prometheus — loading the alert rules

Severity convention

Alertmanager routing examples

PagerDuty (page-severity → wake the operator)

Slack-only (warnings + pages both go to chat)

Email-only (air-gapped, no SaaS notifier)

Sample PromQL — the five most common triage questions

1. Where are my errors coming from?

2. Why is my latency spiking?

3. Is the scan dispatcher draining?

4. Which peer is replication-lagging?

5. How healthy is the database pool?

Alert runbook

High error rate

High latency

Pg pool near exhausted

S3 high error rate

Scan queue deep

Replication lag high

License expiring soon

License expired

Retention sweep stuck

Scan jobs failing

Log volume spike

Verifying the bundle locally

Out of scope (today)