Skip to content

Commit

Permalink
Replica lag settings and guide (#6671)
Browse files Browse the repository at this point in the history
* initial pass at redoing replica lag

* fixed pipe

---------

Co-authored-by: Carrie Warner (Mattermost) <[email protected]>
  • Loading branch information
coltoneshaw and cwarnermm authored Oct 12, 2023
1 parent 5c04aac commit ab0b9ee
Show file tree
Hide file tree
Showing 3 changed files with 103 additions and 19 deletions.
122 changes: 103 additions & 19 deletions source/configure/database-configuration-settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -376,7 +376,7 @@ Replica lag settings
| String array input consists of: | |
| | |
| - ``DataSource``: The database credentials to connect | |
| to the replica instance. | |
| to the database instance. | |
| - ``QueryAbsoluteLag``: A plain SQL query that must | |
| return a single row. The first column must be the | |
| node value of the Prometheus metric, and the second | |
Expand All @@ -388,27 +388,111 @@ Replica lag settings
| column must be the value of the lag used to measure | |
| the time lag. | |
+--------------------------------------------------------+----------------------------------------------------------------------------------+
| Examples: |
| **Notes**: |
| |
| For AWS Aurora instances, ``QueryAbsoluteLag`` can be: |
| |
| .. code-block:: sql |
| |
| select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<> |
| |
| And for AWS Aurora instances, ``QueryTimeLag`` can be: |
| |
| .. code-block:: sql |
| |
| select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<> |
| |
| For MySQL Group Replication, the absolute lag can be measured from the number of pending transactions in the applier queue: |
| |
| .. code-block:: sql |
| |
| select member_id, count_transactions_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<> |
| - The ``QueryAbsoluteLag`` and ``QueryTimeLag`` queries must return a single row. |
| - To properly monitor this you must setup `performance monitoring </scale/performance-monitoring.html>`__ for Mattermost. |
+--------------------------------------------------------+----------------------------------------------------------------------------------+

1. Configure the replica lag metric based on your database type. See the following tabs for details on configuring this for each database type.

.. tabs::

.. tab:: AWS Aurora

Add the configuration highlighted below to your ``SqlSettings.ReplicaLagSettings`` array. You only need to add this once because replication statistics for AWS Aurora nodes are visible across all server instances that are members of the cluster. Be sure to change the ``DataSource`` to point to a single node in the group.

For more information on Aurora replication stats, see the `AWS Aurora documentaion <https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora_global_db_instance_status.html>`__.

**Example:**

.. code-block:: json
:emphasize-lines: 4,5,6,7,8
{
"SqlSettings": {
"ReplicaLagSettings": [
{
"DataSource": "replica-1",
"QueryAbsoluteLag": "select server_id, highest_lsn_rcvd-durable_lsn as bindiff from aurora_global_db_instance_status() where server_id=<>",
"QueryTimeLag": "select server_id, visibility_lag_in_msec from aurora_global_db_instance_status() where server_id=<>"
}
]
}
}
.. tab:: MySQL Group Replication

Add the configuration highlighted below to your ``SqlSettings.ReplicaLagSettings`` array. You only need to add this once because replication statistics for all nodes are shared across all server instances that are members of the MySQL replication group. Be sure to change the ``DataSource`` to point to a single node in the group.

For more information on group replication stats, see the `MySQL documentation <https://dev.mysql.com/doc/refman/8.0/en/group-replication-replication-group-member-stats.html>`__.

**Example:**

.. code-block:: json
:emphasize-lines: 4,5,6,7,8
{
"SqlSettings": {
"ReplicaLagSettings": [
{
"DataSource": "replica-1",
"QueryAbsoluteLag": "select member_id, count_transactions_remote_in_applier_queue FROM performance_schema.replication_group_member_stats where member_id=<>",
"QueryTimeLag": ""
}
]
}
}
.. tab:: PostgreSQL replication slots

1. Add the configuration highlighted below to your ``SqlSettings.ReplicaLagSettings`` array. This query should run against the **primary** node in your cluster, to do this change the ``DataSource`` to match the `SqlSettings.DataSource <#data-source>`__ setting you have configured.

For more information on pg_stat_replication, see the `PostgreSQL documentation <https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-REPLICATION-VIEW>`__.

**Example:**

.. code-block:: json
:emphasize-lines: 4,5,6,7,8
{
"SqlSettings": {
"ReplicaLagSettings": [
{
"DataSource": "postgres://mmuser:password@localhost:5432/mattermost_test?sslmode=disable&connect_timeout=10.",
"QueryAbsoluteLag": "select usename, pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) as metric from pg_stat_replication;",
"QueryTimeLag": ""
}
]
}
}
2. Grant permissions to the database user for ``pg_monitor``. This user should be the same user configured above in the ``DataSource`` string.

For more information on roles, see the `PostgreSQL documentation <https://www.postgresql.org/docs/10/default-roles.html>`__.

.. code-block:: bash
sudo -u postgres psql
postgres=# GRANT pg_monitor TO mmuser;
2. Save the config and restart all Mattermost nodes.
3. Navigate to your Grafana instance monitoring Mattermost and open the `Mattermost Performance Monitoring v2 <https://grafana.com/grafana/dashboards/15582-mattermost-performance-monitoring-v2/>`__ dashboard.
4. The ``QueryTimeLag`` chart is already setup for you utilizing the existing ``Replica Lag`` chart. If using ``QueryAbsoluteLag`` metric clone the ``Replica Lag`` chart and edit the query to use the below absolute lag metrics and modify the title to be ``Replica Lag Absolute``.

.. code-block:: none
mattermost_db_replica_lag_abs{instance=~"$server"}
.. image:: ../../source/images/database-configuration-settings-replica-lag-grafana-1.jpg
:alt: A screenshot showing how to clone a chart within Grafana


.. image:: ../../source/images/database-configuration-settings-replica-lag-grafana-2.jpg
:alt: A screenshot showing the specific edits to make to the cloned grafana chart.


.. config:setting:: database-replicamonitorintervalseconds
:displayname: Replica monitor interval (Database)
:systemconsole: N/A
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ab0b9ee

Please sign in to comment.