-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync-diff-inspector: Use md5 func replace crc32 for checksum #707
Conversation
[REVIEW NOTIFICATION] This pull request has not been approved. To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
@lance6716 @Leavrth PTAL,tks. |
Hi, I think MD5 is much slower than CRC32, can you test about thier performance? |
@@ -770,24 +770,21 @@ func GetCountAndCRC32Checksum(ctx context.Context, db *sql.DB, schemaName, table | |||
columnIsNull = append(columnIsNull, fmt.Sprintf("ISNULL(%s)", name)) | |||
} | |||
|
|||
query := fmt.Sprintf("SELECT COUNT(*) as CNT, BIT_XOR(CAST(CRC32(CONCAT_WS(',', %s, CONCAT(%s)))AS UNSIGNED)) as CHECKSUM FROM %s WHERE %s;", | |||
strings.Join(columnNames, ", "), strings.Join(columnIsNull, ", "), dbutil.TableName(schemaName, tableName), limitRange) | |||
query := fmt.Sprintf("SELECT COUNT(*) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', %s, CONCAT(%s))), 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', %s, CONCAT(%s))), 17, 16), 16, 10) AS UNSIGNED)) RMD5 FROM %s WHERE %s;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭 Unfortunately, it will take twice as long.
Currently TiDB hasn't optimized this kind of SQL. The sub expression MD5(CONCAT_WS(',', %s, CONCAT(%s)))
will be calculated twice.
pingcap/tidb#39576
I use sync-diff-inspector check 2T data in my production environment. |
How many are columns of the test tables? It shows that the speed of |
There are 93 tables. The small one has 5 columns and the big one has 25 columns.
|
mysql> SELECT COUNT(1) AS CNT, BIT_XOR(CAST(CRC32(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`))))AS UNSIGNED)) as BIT_XOR_CHECKSUM FROM test.stock;
+----------+------------------+
| CNT | BIT_XOR_CHECKSUM |
+----------+------------------+
| 12800000 | 385643470 |
+----------+------------------+
1 row in set (5.97 sec)
mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(@crc, 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(@crc :=MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5 FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+----------------------+
| CNT | LMD5 | RMD5 |
+----------+----------------------+----------------------+
| 12800000 | 17055727021754144759 | 17409312425610765406 |
+----------+----------------------+----------------------+
1 row in set (1 min 37.94 sec)
mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 1, 16), 16, 10) AS UNSIGNED)) LMD5, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 17, 16), 16, 10) AS UNSIGNED)) RMD5 FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+----------------------+
| CNT | LMD5 | RMD5 |
+----------+----------------------+----------------------+
| 12800000 | 18443984006065217657 | 17409312425610765406 |
+----------+----------------------+----------------------+
1 row in set (11.99 sec)
mysql> SELECT COUNT(1) as CNT, BIT_XOR(CAST(CONV(SUBSTRING(MD5(CONCAT_WS(',', `s_i_id`, `s_w_id`, `s_quantity`, `s_dist_01`, `s_dist_02`, `s_dist_03`, `s_dist_04`, `s_dist_05`, `s_dist_06`, `s_dist_07`, `s_dist_08`, `s_dist_09`, `s_dist_10`, `s_ytd`, `s_order_cnt`, `s_remote_cnt`, `s_data`, CONCAT(ISNULL(`s_i_id`), ISNULL(`s_w_id`)))), 1, 16), 16, 10) AS UNSIGNED)) LMD5 FROM `test`.`stock` WHERE ((TRUE) AND (TRUE));
+----------+----------------------+
| CNT | LMD5 |
+----------+----------------------+
| 12800000 | 18443984006065217657 |
+----------+----------------------+
1 row in set (7.52 sec)
If use
the data is generated by |
closed by #787 |
What problem does this PR solve?
Issue Number: close #703 #634
What is changed and how it works?
Use md5 func replace crc32 for checksum.Data check is more accurate.
Check List
Tests
Code changes
Side effects
Related changes