Skip to content

Commit

Permalink
Add translations for disaster recovery
Browse files Browse the repository at this point in the history
  • Loading branch information
Meggielqk committed Sep 4, 2024
1 parent 3e1f60b commit b9f7eed
Show file tree
Hide file tree
Showing 6 changed files with 83 additions and 29 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 14 additions & 10 deletions en_US/durability/management.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,26 @@ This document provides references and instructions for configuring, managing, an

## Configuration Parameters

MQTT Durable Sessions configuration is divided into 2 main categories:
MQTT Durable Sessions configuration is divided into two main categories:

- `durable_sessions`: Contains settings related to MQTT clients' sessions, including how they consume data from durable storage and data retention parameters.
- `durable_storage` Manages the settings of the durable storage system holding the MQTT message data.

### Durable Sessions Configuration

| Parameter | Description |
| ------------------------------------------- | ------------------------------------------------------------ |
| `durable_sessions.enable` | Enables session durability. Note: Restart of the EMQX node is required for changes to take effect. |
| `durable_sessions.batch_size` | Controls the maximum size of message batches consumed from the storage by durable sessions. |
| `durable_sessions.idle_poll_interval` | Controls the frequency of querying the storage for new messages by durable sessions. If new messages are found, the next batch is retrieved immediately if the client's in-flight queue has space. |
| `durable_sessions.heartbeat_interval` | Specifies the interval for saving session metadata. |
| `durable_sessions.renew_streams_interval` | Defines how often sessions query the storage for new streams. |
| `durable_sessions.session_gc_interval` | Specifies the interval for sweeping through sessions and deleting expired ones. |
| `durable_sessions.message_retention_period` | Defines the retention period of MQTT messages in durable sessions. Note: this parameter is global. |
You can configure the parameters for durable sessions in the Dashboard. Click **Management** -> **MQTT Settings** in the left menu of the Dashboard, and then select the **Durable Session** tab to configure the parameters.

<img src="./assets/dashboard_session_config.png" alt="dashboard_session_config" style="zoom:67%;" />

| Parameter | Dashboard UI | Description |
| ------------------------------------------- | -------------------------- | ------------------------------------------------------------ |
| `durable_sessions.enable` | Enable Durable Sessions | Enables session durability. This configuration item cannot be modified through hot configuration; you need to set it in the configuration file. Note: Restart of the EMQX node is required for changes to take effect. |
| `durable_sessions.message_retention_period` | Message Retention Period | Defines the retention period of MQTT messages in durable sessions. Note: this parameter is global. |
| `durable_sessions.batch_size` | Message Query Batch Size | Controls the maximum size of message batches consumed from the storage by durable sessions. |
| `durable_sessions.idle_poll_interval` | Idel Poll Interval | Controls the frequency of querying the storage for new messages by durable sessions. If new messages are found, the next batch is retrieved immediately if the client's in-flight queue has space. |
| `durable_sessions.heartbeat_interval` | Session Heartbeat Interval | Specifies the interval for saving session metadata. |
| `durable_sessions.renew_streams_interval` | - | Defines how often sessions query the storage for new streams. |
| `durable_sessions.session_gc_interval` | Session GC Interval | Specifies the interval for sweeping through sessions and deleting expired ones. |


The following parameters can be overridden per [zone](../configuration/configuration.md#zone-override):
Expand Down
22 changes: 12 additions & 10 deletions en_US/durability/managing-replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,20 +89,22 @@ This approach minimizes the volume of data transferred between sites, while ensu

## Recover from Disasters

When things go extremely wrong it's important to know how to recover efficiently. This section provides guidance on how to recover from common disaster scenarios.
When disasters occur, knowing how to efficiently recover is crucial to maintaining service continuity. This section provides guidance on recovering from common disaster scenarios.

### Complete Loss of a Node

Probably the most common disaster scenario is losing a node completely, due to a unrecoverable hardware failure, disk corruption or plain human mistake.
One of the most common disaster scenarios is the complete loss of a node, which can occur due to unrecoverable hardware failure, disk corruption, or even human error.

1. Once a node is completely lost, availability is partially compromised. Hence, it's probably a good idea to first restore desired availability, by moving the lost node's shards to other sites.
1. Restore availability by reallocating shards.

If a node is completely lost, the cluster's availability is compromised to some extent. The first step is to restore availability by reallocating the lost node’s shards to other nodes in the cluster.

Usual `leave` command should be enough to achieve this. It works even if the node is not reachable. However, in this case transitions may take longer time to complete.
You can use the standard `leave` command to achieve this. This command can still function even if the lost node is unreachable, although the transition may take longer to complete.
```shell
$ emqx ctl ds leave messages 5C6028D6CE9459C7 # Here, 5C6028D6CE9459C7 is the lost node's Site ID
```

2. Watch the cluster status, transitions should eventually complete.
$ emqx ctl ds leave messages 5C6028D6CE9459C7 # Here, 5C6028D6CE9459C7 is the lost node's Site ID
```
2. Monitor the cluster status and wait for all shard transitions to complete successfully. Ensure there are no more transitions before proceeding to the next step.

```shell
$ emqx ctl ds info
Expand All @@ -119,10 +121,10 @@ Probably the most common disaster scenario is losing a node completely, due to a
<...>
```

3. Once there are no more transitions, it's time to tell the cluster that the lost node is not coming back.
3. Once all shard transitions are complete, you need to inform the cluster that the lost node will not be returning.

```shell
$ emqx ctl ds forget messages 5C6028D6CE9459C7
```

It's very important to perform this step if the plan is to replace the lost node with a new one, preserving the original node name. Otherwise, the cluster will have the same node name known under two different Site IDs, which will cause a lot of confusion down the road.
This step is crucial if you plan to replace the lost node with a new one using the original node name. Failing to do so could result in the cluster recognizing the same node name under two different Site IDs, leading to significant confusion and potential issues.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 14 additions & 9 deletions zh_CN/durability/management.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,22 @@

### 持久会话配置

| 参数 | 描述 |
| ------------------------------------------- | ------------------------------------------------------------ |
| `durable_sessions.enable` | 启用会话持久性。注意:需要重新启动 EMQX 节点才能使更改生效。 |
| `durable_sessions.batch_size` | 控制持久会话从存储中消费的消息批次的最大大小。 |
| `durable_sessions.idle_poll_interval` | 控制持久会话查询新消息的频率。如果发现新消息,则下一批将立即从存储中检索,如果客户端的传输队列有空间的话。 |
| `durable_sessions.heartbeat_interval` | 指定保存会话元数据的间隔。 |
| `durable_sessions.renew_streams_interval` | 定义会话多久查询存储以获取新流。 |
| `durable_sessions.session_gc_interval` | 指定清除会话并删除过期会话的间隔。 |
| `durable_sessions.message_retention_period` | 定义会话持久化中 MQTT 消息的保留期。注意:此参数是全局的。 |
您可以在 Dashboard 中配置持久会话的相关参数。点击 Dashboard 左侧菜单中的 **管理** -> **MQTT 配置**,选择**会话持久化**标签页进行参数配置。

<img src="./assets/dashboard_session_config.png" alt="dashboard_session_config" style="zoom:67%;" />

| 参数 | Dashboard 配置项 | 描述 |
| ------------------------------------------- | ------------------ | ------------------------------------------------------------ |
| `durable_sessions.enable` | 启用会话持久化 | 启用会话持久化。该配置项不支持通过热配置修改,您需要在配置文件中设置`启用``禁用`。注意:需要重新启动 EMQX 节点才能使更改生效。 |
| `durable_sessions.message_retention_period` | 消息保留时长 | 定义会话持久化中 MQTT 消息的保留期。注意:此参数是全局的。 |
| `durable_sessions.batch_size` | 消息查询批大小 | 控制持久会话从存储中消费的消息批次的最大大小。 |
| `durable_sessions.idle_poll_interval` | 空闲轮询间隔 | 控制持久会话查询新消息的频率。如果发现新消息,则下一批将立即从存储中检索,如果客户端的传输队列有空间的话。 |
| `durable_sessions.heartbeat_interval` | 会话心跳间隔 | 指定保存会话元数据的间隔。 |
| `durable_sessions.renew_streams_interval` | - | 定义会话多久查询存储以获取新流。 |
| `durable_sessions.session_gc_interval` | 会话垃圾回收批大小 | 指定清除会话并删除过期会话的间隔。 |

以下参数可以在 [zone](../configuration/configuration.md#zone-override) 级别覆盖:

- `durable_sessions.enable`
- `durable_sessions.batch_size`
- `durable_sessions.idle_poll_interval`
Expand Down
43 changes: 43 additions & 0 deletions zh_CN/durability/managing-replication.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,46 @@ $ emqx ctl ds set_replicas messages <Site ID 1> <Site ID 2> ...
```

这种方法可以最大程度地减少站点之间的数据传输量,同时确保尽可能地维持复制因子。

## 灾难恢复

当灾难发生时,知道如何高效地进行节点的恢复对于维护服务的连续性至关重要。本节提供了从常见灾难场景中恢复节点的指导。

### 节点的完全丢失

最常见的灾难场景之一是节点的完全丢失,这可能是由于无法恢复的硬件故障、磁盘损坏或人为错误造成的。

1. 通过重新分配分片来恢复可用性。

如果一个节点完全丢失,集群的可用性会在某种程度上受到影响。第一步是通过将丢失节点的分片重新分配到集群中的其他节点来恢复可用性。

您可以使用标准的 `leave` 命令来实现这一点。即使丢失的节点不可访问,该命令仍然可以运行,但转换可能需要更长时间完成。

```shell
$ emqx ctl ds leave messages 5C6028D6CE9459C7 # 此处的 5C6028D6CE9459C7 是丢失节点的 Site ID
```

2. 监控集群状态并等待所有分片转换成功完成。在继续进行下一步之前,确保没有更多的转换。

```shell
$ emqx ctl ds info
<...>
SITES:
D8894F95DC86DFDB '[email protected]' up
5C6028D6CE9459C7 '[email protected]' (x) down
<...>
REPLICA TRANSITIONS:
Shard Transitions
messages/0 -5C6028D6CE9459C7 +D8894F95DC86DFDB
<...>
```

3. 一旦所有分片转换完成,您需要告知集群丢失的节点不会返回。

```shell
$ emqx ctl ds forget messages 5C6028D6CE9459C7
```

如果计划使用原始节点名称替换丢失的节点,这一步至关重要。如果不这样做,可能会导致集群在两个不同的 Site ID 下识别出相同的节点名称,从而导致严重的混淆和潜在的问题。

0 comments on commit b9f7eed

Please sign in to comment.