Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 77 additions & 7 deletions content/integrate/redis-data-integration/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,12 @@ to query the metrics and plot simple graphs or with
[Grafana](https://grafana.com/) to produce more complex visualizations and
dashboards.

RDI exposes two endpoints, one for [CDC collector metrics](#collector-metrics) and
another for [stream processor metrics](#stream-processor-metrics).
RDI exposes metrics for two main components:
- [CDC collector](#collector-metrics)
- [Stream processor](#stream-processor-metrics)
- [Operator](#operator-metrics)


The sections below explain these sets of metrics in more detail.
See the
[architecture overview]({{< relref "/integrate/redis-data-integration/architecture#overview" >}})
Expand All @@ -37,9 +41,67 @@ RDI metrics with the RDI monitoring screen in Redis Insight or with the
[`redis-di status`]({{< relref "/integrate/redis-data-integration/reference/cli/redis-di-status" >}})
command from the CLI.{{< /note >}}

## Collector metrics
## Accessing the metrics

The way you access the metrics endpoints depends on whether you are using a VM installation or a Helm installation for RDI. The sections below describe the correct approach for each installation type.

### VM Installation

For VM installations, the metrics are available by default on the following endpoints:
- Collector metrics: `https://<RDI_HOST>/collector-source/metrics`
- Stream processor metrics: `https://<RDI_HOST>/metrics`
- Operator metrics: `https://<RDI_HOST>/operator/metrics`

Please note that for RDI versions prior to 1.16.0 the collector metrics are not accessible.

### Helm installation

For Helm installations, the metrics are available via autodiscovery in the K8s cluster. Follow the steps below to use them:
1. Make sure you have the Prometheus Operator installed in your K8s cluster (see the
[Prometheus Operator installation guide](https://prometheus-operator.dev/docs/getting-started/installation/) for more information about this).

2. Update your values.yaml file to enable metrics for the operator, collector and stream processor components.

- For the collector, update the `collector` section, under the `dataPlane` section:
```yaml
dataPlane:
collector:
# Enable service monitor
serviceMonitor:
enabled: true

# Make sure to label the ServiceMonitor so that Prometheus can discover it
labels:
release: prometheus
```

The endpoint for the collector metrics is `https://<RDI_HOST>/metrics/collector-source`
- For the stream processor, update the `rdiMetricsExporter` section:
```yaml
rdiMetricsExporter:
# Enable service monitor
serviceMonitor:
enabled: true

# Make sure to label the ServiceMonitor so that Prometheus can discover it
labels:
release: prometheus
```

- For the operator, update the `operator` section:
```yaml
operator:
prometheus:
enabled: true
labels:
release: prometheus
metrics:
enabled: true
```

{{< note >}}The Prometheus service discovery loop runs at regular intervals. This means that after deploying or updating RDI with the above configuration, it may take a few minutes for Prometheus to discover the new ServiceMonitors and start scraping metrics from the RDI components.
{{< /note >}}

## Collector metrics

These metrics are divided into three groups:

Expand Down Expand Up @@ -89,10 +151,8 @@ The following table lists all collector metrics and their descriptions:
{{< note >}}
Many metrics include context labels that specify the phase (`snapshot` or `streaming`), database name, and other contextual information. Metrics with a value of `-1` typically indicate that the measurement is not applicable in the current state.
{{< /note >}}

## Stream processor metrics

The endpoint for the stream processor metrics is `https://<RDI_HOST>/metrics/rdi`
## Stream processor metrics

RDI reports metrics during the two main phases of the ingest pipeline, the *snapshot*
phase and the *change data capture (CDC)* phase. (See the
Expand Down Expand Up @@ -145,6 +205,16 @@ RDI reports with their descriptions.
- **Last batch metrics**: Show real-time performance data for the most recently processed batch
{{< /note >}}

## Operator metrics
Most of the metrics exposed by the RDI operator are standard controller-runtime [metrics](https://book.kubebuilder.io/reference/metrics-reference).
The metrics that are relevant for RDI operations are listed in the table below:

| Metric Name | Metric Type | Metric Description | Alerting Recommendations |
|-------------|-------------|--------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| `rdi_operator_is_leader` | Gauge | Current leadership status (1 = leader, 0 = not leader) | Informational - may be used for alerting there is no leader for prolonged period of time. |
| `rdi_operator_pipeline_phase` | Gauge | Current Pipeline phase | Informational - may be used for alerting if operator is stuck in the `Resetting` phase for a prolonged period of time |


## Recommended alerting strategy

The alerting strategy described in the sections below focuses on system failures and data integrity issues that require immediate attention. Most other metrics are informational, so you should monitor them for trends rather than trigger alerts.
Expand Down