Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't restore from mysql 8.0 logical dump if charset is utf8 #31790

Closed
morgo opened this issue Jan 19, 2022 · 7 comments · Fixed by #44655
Closed

Can't restore from mysql 8.0 logical dump if charset is utf8 #31790

morgo opened this issue Jan 19, 2022 · 7 comments · Fixed by #44655
Labels
compatibility-mysql8 This is a compatibility issue with MySQL 8.0(but NOT 5.7) feature/discussing This feature request is discussing among product managers sig/sql-infra SIG: SQL Infra type/enhancement The issue or PR belongs to an enhancement.

Comments

@morgo
Copy link
Contributor

morgo commented Jan 19, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

In MySQL 8.0, the charset utf8 is deprecated, to be replaced by utf8mb4. If a user previously created a table with 'utf8' it still points to the mb3 version for compatibility, but reading back the 'show create table' output it now also sets the value to utf8mb3.

In MySQL 8.0:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a int) DEFAULT CHARSET=utf8;
SHOW CREATE TABLE t1\G

The output of SHOW CREATE TABLE t1 can not be restored on TiDB if new collations are enabled (update: affects default config):

mysql [localhost:8027] {msandbox} (test) > SHOW CREATE TABLE t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `a` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
1 row in set (0.01 sec)

2. What did you expect to see? (Required)

Success

3. What did you see instead (Required)

tidb> CREATE TABLE `t1` (
    ->   `a` int DEFAULT NULL
    -> ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
    -> ;
ERROR 1115 (42000): Unknown character set: 'utf8mb3'

It would be nice if this instead internally remapped to utf8mb4 since that makes sense.

4. What is your TiDB version? (Required)

tidb_version(): Release Version: v5.5.0-alpha-136-g50704075a
Edition: Community
Git Commit Hash: 50704075afa7c0e3f2aa1fc9a66f440884a8f3fe
Git Branch: master
UTC Build Time: 2022-01-18 00:29:17
GoVersion: go1.16.9
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)
@morgo morgo added the type/bug The issue is confirmed as a bug. label Jan 19, 2022
@djshow832 djshow832 self-assigned this Jan 19, 2022
@jebter jebter added affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. labels Jan 19, 2022
@bb7133
Copy link
Member

bb7133 commented Jan 19, 2022

I don't think it is a bug since:

  1. We didn't have the explicit announcement for MySQL 8 support.
  2. The utf8mb3 was not implemented(it was just not in the plan): https://docs.pingcap.com/tidb/stable/character-set-and-collation

@bb7133 bb7133 added type/enhancement The issue or PR belongs to an enhancement. and removed type/bug The issue is confirmed as a bug. severity/major affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. affects-5.3 This bug affects 5.3.x versions. affects-5.4 This bug affects 5.4.x versions. labels Jan 19, 2022
@bb7133
Copy link
Member

bb7133 commented Jan 19, 2022

However, I think we need to support utf8mb3 for MySQL 8 compatibility, my preference is adding this issue to 'Essential' section for #7968.

@morgo @djshow832 WDYT?

@djshow832
Copy link
Contributor

utf8mb4 can be considered as a superset of utf8mb3, so mapping utf8mb3 to utf8mb4 implicitly won't block users from migrating from MySQL.

My concerns are:

  • Users will get confused when he sees that the charset is changed to utf8mb4
  • The maximum length of each character is changed to 4 bytes (when users inserted supplementary characters), so users may need to reconsider the maximum length of string types.

@morgo
Copy link
Contributor Author

morgo commented Jan 19, 2022

  • Users will get confused when he sees that the charset is changed to utf8mb4

utf8mb3 is deprecated in MySQL, so it's a feature to change it. In the MySQL case they can't automatically change it because the collations would differ. In our case, our utf8mb4 collations are not strictly compatible anyway.

  • The maximum length of each character is changed to 4 bytes (when users inserted supplementary characters), so users may need to reconsider the maximum length of string types.

This is a minor issue. All the existing data (assuming it is importing a dump file) would still be 3 bytes. It would only be for 4 byte characters. The original motivation for MySQL using 3 bytes was that many in-memory internal buffers were fixed length, so utf8mb4 performed worse (and at the time utf8 was up to 6 bytes). This is no longer true, so there is not a use-case for 3-byte utf8.

@morgo
Copy link
Contributor Author

morgo commented Jan 19, 2022

I thought this was because of new_collations_enabled_on_first_bootstrap = true, but it appears not. It applies for all configurations.

MySQL considers the two create table statements identical:

CREATE TABLE t1 (a int) DEFAULT CHARSET=utf8mb3;
CREATE TABLE t1 (a int) DEFAULT CHARSET=utf8;

TiDB only permits the second statement. MySQL 8.0 changed the show create table output so the second statement produces the first statement, which is a compatibility issue.

@djshow832 djshow832 removed their assignment Jan 20, 2022
@bb7133 bb7133 added the compatibility-mysql8 This is a compatibility issue with MySQL 8.0(but NOT 5.7) label Jan 21, 2022
@bb7133
Copy link
Member

bb7133 commented Jan 21, 2022

It would be nice if this instead internally remapped to utf8mb4 since that makes sense.

I'm afraid it cannot be applied. At the very early stages of TiDB, utf8 was treated as utf8mb4 since we thought it was the superset of utf8 of MySQL(and utf8mb3) without any performance issue. However, it gives a lot of troubles for users who migrating data from TiDB to MySQL(which happens often).

So finally TiDB decided to change utf8 exactly the same with MySQL, and this is why check-mb4-value-in-utf8 and treat-old-version-utf8-as-utf8mb4 was introduced.

@tiancaiamao tiancaiamao added the feature/discussing This feature request is discussing among product managers label Jan 24, 2022
@dveeden
Copy link
Contributor

dveeden commented Aug 31, 2022

I think #26226 is the same basic issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility-mysql8 This is a compatibility issue with MySQL 8.0(but NOT 5.7) feature/discussing This feature request is discussing among product managers sig/sql-infra SIG: SQL Infra type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
7 participants