Skip to content

Commit

Permalink
Preview PR pingcap/docs-tidb-operator#2406 and this preview is trigge…
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions committed Aug 24, 2023
1 parent e9db05c commit 68c53e6
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 24 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,6 @@ This document describes how to back up the data of a TiDB cluster deployed acros

The backup method described in this document is implemented based on CustomResourceDefinition (CRD) in [BR Federation](br-federation-architecture.md#br-federation-architecture-and-processes) and TiDB Operator. [BR](https://docs.pingcap.com/tidb/stable/backup-and-restore-overview) (Backup & Restore) is a command-line tool for distributed backup and recovery of the TiDB cluster data. For the underlying implementation, BR gets the backup data of the TiDB cluster, and then sends the data to the AWS storage.

> **Note:**
>
> > storage blocks on volumes that were created from snapshots must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.
>
> From AWS documentation, the EBS volume restored from snapshot may have high latency before it's initialized, which can result in big performance hit of restored TiDB cluster. See details in [ebs create volume from snapshot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-volume.html#ebs-create-volume-from-snapshot).
>
> To initialize the restored volume more efficiently, you should **separate WAL and raft log to a dedicated small volume from TiKV data**. So that we can improve write performance of restored TiDB cluster by full initializing the volume of WAL and raft log.
## Usage scenarios

If you have the following requirements when backing up TiDB cluster data, you can use TiDB Operator to back up the data using volume snapshots and metadata to Amazon S3:
Expand All @@ -26,6 +18,14 @@ If you have the following requirements when backing up TiDB cluster data, you ca

If you have any other requirements, refer to [Backup and Restore Overview](backup-restore-overview.md) and select an appropriate backup method.

## Prerequisites

> storage blocks on volumes that were created from snapshots must be initialized (pulled down from Amazon S3 and written to the volume) before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.
From AWS documentation, the EBS volume restored from snapshot may have high latency before it's initialized, which can result in big performance hit of restored TiDB cluster. See details in [ebs create volume from snapshot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-volume.html#ebs-create-volume-from-snapshot).

To initialize the restored volume more efficiently, you should **separate WAL and raft log to a dedicated small volume from TiKV data**. So that we can improve write performance of restored TiDB cluster by full initializing the volume of WAL and raft log.

## Limitations

- Snapshot backup is applicable to TiDB Operator v1.5.1 or later versions, and TiDB v6.5.4 or later versions.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,30 +5,33 @@ summary: Learn about the common questions and solutions for EBS snapshot backup

# FAQs on EBS Snapshot Backup and Restore across Multiple Kubernetes

This document addresses common questions and solutions related to EBS snapshot backup and restore across multiple Kubernetes environments.

## New tags on snapshots and restored volumes

Symptom: Some tags are automatically added to generated snapshots and restored EBS volumes
**Symptom:** Some tags are automatically added to generated snapshots and restored EBS volumes

Explanation: Those new tags are added for traceability. Snapshots will inherit all tags from individual source EBS volumes, and restored EBS volumes inherit tags from source snapshots but prefix keys with `snapshot\`. Besides that, new tags like <TiDBCluster-BR: true>, <snapshot/createdFromSnapshotId, {source-snapshot-id}> are added to restored EBS volumes.
**Explanation:** The new tags are added for traceability. Snapshots inherit all tags from the individual source EBS volumes, while restored EBS volumes inherit tags from the source snapshots but prefix keys with `snapshot\`. Additionally, new tags such as `<TiDBCluster-BR: true>`, `<snapshot/createdFromSnapshotId, {source-snapshot-id}>` are added to restored EBS volumes.

## Backup Initialize Failed

Symptom: You get the error that contains `GC safepoint 443455494791364608 exceed TS 0` when backup are initializing.
**Symptom:** You get the error that contains `GC safepoint 443455494791364608 exceed TS 0` when the backup is initializing.

**Solution:** This issue might occur if you have disabled the feature of "resolved ts" in TiKV or PD. Check the configuration of TiKV and PD:

Solution: Probably you have forbidden the feature of "resolved ts" in TiKV or PD, so you should check the configuration of TiKV and PD.
For TiKV configuration, confirm if you set `resolved-ts.enable = false` or `raftstore.report-min-resolved-ts-interval = "0s"`. If you set, please remove the configuration.
For PD configuration, confirm if you set `pd-server.min-resolved-ts-persistence-interval = "0s"`. If you set, please remove the configuration.
- For TiKV, confirm if you set `resolved-ts.enable = false` or `raftstore.report-min-resolved-ts-interval = "0s"`. If so, remove these configurations.
- For PD, confirm if you set `pd-server.min-resolved-ts-persistence-interval = "0s"`. If so, remove this configuration.

## Backup failed due to execution twice

**Issue:** [#5143](https://github.com/pingcap/tidb-operator/issues/5143)

Symptom: You get the error that contains `backup meta file exists`, and you can find the backup pod is scheduled twice.
**Symptom:** You get the error that contains `backup meta file exists`, and the backup pod is scheduled twice.

Solution: Probably the first backup pod is evicted by Kubernetes due to node resource pressure. You can configure `PriorityClass` and `ResourceRequirements` to reduce the possibility of eviction. Please refer to the [comment of issue](https://github.com/pingcap/tidb-operator/issues/5143#issuecomment-1654916830).
**Solution:** This issue might occur if the first backup pod is evicted by Kubernetes due to node resource pressure. You can configure `PriorityClass` and `ResourceRequirements` to reduce the possibility of eviction. For more details, refer to the [comment of issue](https://github.com/pingcap/tidb-operator/issues/5143#issuecomment-1654916830).

## Save time for backup by controlling snapshot size calculation level

Symptom: Scheduled backup can't be finished in expected window due to the cost of snapshot size calculation.
**Symptom:** Scheduled backup can't be completed in the expected window due to the cost of snapshot size calculation.

Solution: By default, both full size and incremental size are calculated by calling AWS service. And the calculation might cost minutes of time. You can set `spec.template.calcSizeLevel` to `full` to skip incremental size calculation, set the value to `incremental` to skip full size calculation, and set the value to `none` to skip both.
**Solution:** By default, both full size and incremental size are calculated by calling the AWS service, which might take several minutes. You can set `spec.template.calcSizeLevel` to `full` to skip incremental size calculation, set it to `incremental` to skip full size calculation, and set it to `none` to skip both calculations.
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ BR Federation coordinates `Backup` and `Restore` Custom Resources (CRs) in the d

The backup process in the data plane consists of three phases:

1. **Phase One:** Request PD to pause region scheduling and Garbage Collection (GC). As each TiKV instance might take snapshots at different times, pausing scheduling and GC can avoid data inconsistencies between TiKV instances during snapshot taking. Since the TiDB components are interconnected across multiple Kubernetes clusters, executing this operation in one Kubernetes cluster affects the entire TiDB cluster.
1. **Phase One:** TiDB Operator schedules a backup pod to request PD to pause region scheduling and Garbage Collection (GC). As each TiKV instance might take snapshots at different times, pausing scheduling and GC can avoid data inconsistencies between TiKV instances during snapshot taking. Since the TiDB components are interconnected across multiple Kubernetes clusters, executing this operation in one Kubernetes cluster affects the entire TiDB cluster.

2. **Phase Two:** Collect meta information such as `TidbCluster` CR and EBS volumes, and then request AWS API to create EBS snapshots. This phase must be executed in each Kubernetes cluster.
2. **Phase Two:** TiDB Operator collects meta information such as `TidbCluster` CR and EBS volumes, and then schedules another backup pod to request AWS API to create EBS snapshots. This phase must be executed in each Kubernetes cluster.

3. **Phase Three:** After EBS snapshots are completed, resume region scheduling and GC for the TiDB cluster. This operation is required only in the Kubernetes cluster where Phase One was executed.
3. **Phase Three:** After EBS snapshots are completed, TiDB Operator deletes the first backup pod to resume region scheduling and GC for the TiDB cluster. This operation is required only in the Kubernetes cluster where Phase One was executed.

![backup process in data plane](/media/volume-backup-process-data-plane.png)

Expand All @@ -45,11 +45,11 @@ The orchestration process of `Backup` from the control plane to the data plane i

The restore process in the data plane consists of three phases:

1. **Phase One:** Call the AWS API to restore the EBS volumes using EBS snapshots based on the backup information. The volumes are then mounted onto the TiKV nodes, and TiKV instances are started in recovery mode. This phase must be executed in each Kubernetes cluster.
1. **Phase One:** TiDB Operator schedules a restore pod to request the AWS API to restore the EBS volumes using EBS snapshots based on the backup information. The volumes are then mounted onto the TiKV nodes, and TiKV instances are started in recovery mode. This phase must be executed in each Kubernetes cluster.

2. **Phase Two:** Use BR to restore all raft logs and KV data in TiKV instances to a consistent state, and then instructs TiKV instances to exit recovery mode. As TiKV instances are interconnected across multiple Kubernetes clusters, this operation can restore all TiKV data and only needs to be executed in one Kubernetes cluster.
2. **Phase Two:** TiDB Operator schedules another restore pod to restore all raft logs and KV data in TiKV instances to a consistent state, and then instructs TiKV instances to exit recovery mode. As TiKV instances are interconnected across multiple Kubernetes clusters, this operation can restore all TiKV data and only needs to be executed in one Kubernetes cluster.

3. **Phase Three:** Restart all TiKV instances to run in normal mode, and start TiDB finally. This phase must be executed in each Kubernetes cluster.
3. **Phase Three:** TiDB Operator restarts all TiKV instances to run in normal mode, and start TiDB finally. This phase must be executed in each Kubernetes cluster.

![restore process in data plane](/media/volume-restore-process-data-plane.png)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ Before restoring a TiDB cluster across multiple Kubernetes clusters from EBS vol
- Deploy a TiDB cluster across multiple Kubernetes clusters that you want to restore data to. For detailed steps, refer to [Deploy a TiDB Cluster across Multiple Kubernetes Clusters](deploy-tidb-cluster-across-multiple-kubernetes.md).
- When deploying the TiDB cluster, add the `recoveryMode: true` field to the spec of `TidbCluster`.

> **Note:**
>
> The EBS volume restored from snapshot may have high latency before it's initialized, which can result in big performance hit of restored TiDB cluster. See details in [ebs create volume from snapshot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-creating-volume.html#ebs-create-volume-from-snapshot).
> So we recommend that you can configure `spec.template.warmup: sync` to initialize TiKV volumes automatically during restoration process.
## Restore process

### Step 1. Set up the environment for EBS volume snapshot restore in every data plane
Expand Down

0 comments on commit 68c53e6

Please sign in to comment.