-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL: Try to store 4 byte character in a 3 Byte UTF8-Field #51
Comments
If i switch the mysql_charset with a MySQL-Server 5.6 another message occur:
To fix this we have to get rid of the Strings as (Primary) Keys. |
It seems to me a limitation in MySQL. I find #48 would make things more complex without gain. |
Indeed, it's a limitation in MySQL: See http://bugs.mysql.com/bug.php?id=4541
|
Possible solutions:
Possibly 191 is still quite long for an URL. Why 192, because 791/4=191.75. It's kind of funny --and sad at the same time--, that here there are exposed 2 limitations in MySQL: MySQL calling something UTF8 when it was not (or a subset of UTF8), and then the limitation on bytes used for indexes. |
Thanks @gpoo for having a look. Of course, i this limitation is really bad :( At the moment i prefer to shorten the mailing_list_url column to 191 chars, because this is a really fast and pragmatic solution. BUT there is another thing. If we switch to utf8mb4 we have to raise the minimum requirement for MySQL to version 5.5.3. In lower versions utf8mb4 is not supported. |
Shortening the column 191 seems fine to me. 5.5.3 was released on March 24th, 2010. It seems safe to bump the requirement. However, it might be necessary to add a note in the README or somewhere else, to point out how to upgrade the unicode support in already existing tables. Like http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html I don't know if @sduenas has a different opinion, though. |
I`ve checked some Linux / Unix distris:
So i think this is enough to prove that MySQL 5.5.3 can be set as a requirement. But until now i was not successful. |
I created a PR (#52) for this, because i finally fixed this for MySQL. I tried to tested this with PostgreSQL as well, but i failed, but the same error message occur with master as well `$ ./mlstats --db-driver 'postgres' --db-hostname 'localhost' --db-user 'operator' --db-password '' --db-name 'mlstats' 'https://lists.libresoft.es/pipermail/metrics-grimoire/'``
|
@andygrunwald, sorry to be late on this. I think to short the indexes to 191 characters is a good solution. My only concern is with message_ID field but I think it will be extremely rare to find something longer than 191. |
Thanks @sduenas. I had a look at RFC 2822. Sadly there is no max. length defined of this field. I suggest we will merge this change. 191 character is a lot of space. |
Fix #51: Add support for 4 byte UTF-8 character in MySQL
Mailinglisten post: https://lists.libresoft.es/pipermail/metrics-grimoire/2014-September/002389.html |
I opened one of my databases and I ran: I got 30 rows, one of them whose length is 180. I would prefer to have some data (aka. some sampling) regarding to the |
Some samples below. As you will see, there are two messages on Xen mailing lists which their message id is larger than 191 characters:
I've used the next query to get the results.
|
Maybe a different approach could be used. We know the problem with MySQL happens when the index is longer than a fixed number of bytes, not in the columns themselves. According to the index options for MySQL in SQLAlchemy, it seems possible to use something like:
|
What do you think about this other approach? |
Sorry for my late reply. To sum up, we got two possibilities:
If we use 1. we assume that 191 chars are enough for a PK to be unique enough. Currently the complete message-id is used as index / key / pk. If a message-id is longer than 191 chars we only use the first 191 chars as unique index. If we use solution 2. we accept that in 1 table there are two different charsets, because as we learned in this PR utf8 != utf8 (utf8mb4) in MySQL. I think this is a possible solution, but a kind of "bad practice". On the other hand a mail client vendor should not use UTF8-Character (3 or 4 byte, does not matter) in a message-id property. But afaik this i not strictly defined. But it is a mail header and this should be ASCII. Hard to decide which solution should be used, because both has advantages and disadvantages. Another idea is to reflect the message-id as pure bytes not as string. |
My opinion goes more in the way of solution 2 but using |
@sduenas +1 |
Fine for me, as well. |
First: I had rollbacked this change in a68bcc0.
comes only from a non decoded email address. To test this UTF8MB4 thingy i sent two emails to the metrics grimoire mailing list: The first post with emojii can't be displayed correctly. |
Instead of sending an email to a mailing list, I think it would be better to add a proper test case with a cooked mbox file :-) Like the ones in Nevertheless, this still can be done and it can be good to check when this is done and to avoid regressions in the future. I have not looked at the actual error yet :-) |
If you execute:
you will get
This is because you cannot store 4-byte characters in MySQL with the utf-8 character set.
Since MySQL 5.5 4-Byte UTF-8 Unicode Encoding is supported.
See for detail information:
Here is a related django ticket: Use utf8mb4 encoding with MySQL 5.5.
Solution:
Use
utf8mb4
charset for MySQL if the used mysql server supports this.The text was updated successfully, but these errors were encountered: