Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL: Try to store 4 byte character in a 3 Byte UTF8-Field #51

Open
andygrunwald opened this issue Sep 7, 2014 · 21 comments
Open

MySQL: Try to store 4 byte character in a 3 Byte UTF8-Field #51

andygrunwald opened this issue Sep 7, 2014 · 21 comments
Labels

Comments

@andygrunwald
Copy link
Contributor

If you execute:

/mlstats --no-report --db-driver 'mysql' --db-hostname 'localhost' --db-user 'root' --db-password '' --db-name 'mlstats' 'http://lists.typo3.org/pipermail/typo3-ug-denmark/'

you will get

Analyzing /Users/agrunwald/.mlstats/compressed/lists.typo3.org/pipermail/typo3-ug-denmark/2006-October.txt.gz
Analyzing /Users/agrunwald/.mlstats/compressed/lists.typo3.org/pipermail/typo3-ug-denmark/2006-November.txt.gz
Analyzing /Users/agrunwald/.mlstats/compressed/lists.typo3.org/pipermail/typo3-ug-denmark/2006-December.txt.gz
Analyzing /Users/agrunwald/.mlstats/compressed/lists.typo3.org/pipermail/typo3-ug-denmark/2007-January.txt.gz
Traceback (most recent call last):
  File "./mlstats", line 38, in <module>
    pymlstats.start()
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/__init__.py", line 166, in start
    quiet, force, web_user, web_password)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 173, in __init__
    t, s, np = self.__analyze_mailing_list(mailing_list)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 225, in __analyze_mailing_list
    total, stored, non_parsed = self.__analyze_list_of_files(mailing_list, archives_to_analyze)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 393, in __analyze_list_of_files
    mailing_list.location)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/db/session.py", line 154, in store_messages
    self.insert_people(name, email)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/db/session.py", line 81, in insert_people
    self.session.commit()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 768, in commit
    self.transaction.commit()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 370, in commit
    self._prepare_impl()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 350, in _prepare_impl
    self.session.flush()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 1907, in flush
    self._flush(objects)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 2025, in _flush
    transaction.rollback(_capture_exception=True)
  File "/Library/Python/2.7/site-packages/sqlalchemy/util/langhelpers.py", line 57, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 1989, in _flush
    flush_context.execute()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 371, in execute
    rec.execute(self)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 524, in execute
    uow
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/persistence.py", line 64, in save_obj
    mapper, table, insert)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/persistence.py", line 568, in _emit_insert_statements
    execute(statement, multiparams)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 727, in execute
    return meth(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 824, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception
    exc_info
  File "/Library/Python/2.7/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
  File "/Library/Python/2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/Library/Python/2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (OperationalError) (1366, "Incorrect string value: '\\xE6\\xF8\\xE5' for column 'email_address' at row 1") 'INSERT INTO people (email_address, name, username, domain_name, top_level_domain) VALUES (%s, %s, %s, %s, %s)' ('none@none.\xe6\xf8\xe5', 'Lars Bonnesen', 'none', 'none.\xe6\xf8\xe5', '\xe6\xf8\xe5')

This is because you cannot store 4-byte characters in MySQL with the utf-8 character set.
Since MySQL 5.5 4-Byte UTF-8 Unicode Encoding is supported.
See for detail information:

Here is a related django ticket: Use utf8mb4 encoding with MySQL 5.5.

Solution:
Use utf8mb4 charset for MySQL if the used mysql server supports this.

@andygrunwald
Copy link
Contributor Author

If i switch the mysql_charset with a MySQL-Server 5.6 another message occur:

Traceback (most recent call last):
  File "./mlstats", line 38, in <module>
    pymlstats.start()
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/__init__.py", line 166, in start
    quiet, force, web_user, web_password)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 149, in __init__
    Database.create_tables(engine, checkfirst=True)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/db/session.py", line 50, in create_tables
    db.Base.metadata.create_all(engine, checkfirst=checkfirst)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/schema.py", line 3317, in create_all
    tables=tables)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1555, in _run_visitor
    conn._run_visitor(visitorcallable, element, **kwargs)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1202, in _run_visitor
    **kwargs).traverse_single(element)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/visitors.py", line 119, in traverse_single
    return meth(obj, **kw)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/ddl.py", line 709, in visit_metadata
    self.traverse_single(table, create_ok=True)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/visitors.py", line 119, in traverse_single
    return meth(obj, **kw)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/ddl.py", line 728, in visit_table
    self.connection.execute(CreateTable(table))
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 727, in execute
    return meth(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/ddl.py", line 67, in _execute_on_connection
    return connection._execute_ddl(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 781, in _execute_ddl
    compiled
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception
    exc_info
  File "/Library/Python/2.7/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
  File "/Library/Python/2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/Library/Python/2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (OperationalError) (1071, 'Specified key was too long; max key length is 767 bytes') '\nCREATE TABLE mailing_lists (\n\tmailing_list_url VARCHAR(255) NOT NULL, \n\tmailing_list_name VARCHAR(255), \n\tproject_name VARCHAR(255), \n\tlast_analysis DATETIME, \n\tPRIMARY KEY (mailing_list_url)\n)ENGINE=InnoDB CHARSET=utf8mb4\n\n' ()

To fix this we have to get rid of the Strings as (Primary) Keys.
This seems to be related to #48

@gpoo
Copy link
Member

gpoo commented Sep 10, 2014

It seems to me a limitation in MySQL. I find #48 would make things more complex without gain.

@gpoo
Copy link
Member

gpoo commented Sep 10, 2014

Indeed, it's a limitation in MySQL: See http://bugs.mysql.com/bug.php?id=4541
or the documentation in http://dev.mysql.com/doc/refman/5.1/en/create-index.html (emphasis mine):

Prefix support and lengths of prefixes (where supported) are storage engine dependent. For example, a prefix can be up to 1000 bytes long for MyISAM tables, and 767 bytes for InnoDB tables.

@gpoo
Copy link
Member

gpoo commented Sep 10, 2014

Possible solutions:

  • Make the columns shorter (e.g. mailing_list_url(191) instead of 255)
  • Use partial indexes (i.e. don't use the whole column, but a subset of each column, like 191 out of 255)

Possibly 191 is still quite long for an URL. Why 192, because 791/4=191.75.

It's kind of funny --and sad at the same time--, that here there are exposed 2 limitations in MySQL: MySQL calling something UTF8 when it was not (or a subset of UTF8), and then the limitation on bytes used for indexes.

@andygrunwald
Copy link
Contributor Author

Thanks @gpoo for having a look.
Regarding #48 i will prepare some benchmarks about this. But i agree with you that this is a more complex thing.

Of course, i this limitation is really bad :(

At the moment i prefer to shorten the mailing_list_url column to 191 chars, because this is a really fast and pragmatic solution.
If this is okay for you i will do this.

BUT there is another thing. If we switch to utf8mb4 we have to raise the minimum requirement for MySQL to version 5.5.3. In lower versions utf8mb4 is not supported.

@gpoo
Copy link
Member

gpoo commented Sep 10, 2014

Shortening the column 191 seems fine to me.

5.5.3 was released on March 24th, 2010. It seems safe to bump the requirement. However, it might be necessary to add a note in the README or somewhere else, to point out how to upgrade the unicode support in already existing tables. Like http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html

I don't know if @sduenas has a different opinion, though.

@andygrunwald
Copy link
Contributor Author

I`ve checked some Linux / Unix distris:

  • Debian:
    • squeeze (oldstable): 5.1.73-1
    • wheezy (stable): 5.5.38-0+wheezy1
    • jessie (testing): 5.5.39-1
    • sid (unstable): 5.5.39-1
  • FreeBSD: 5.5.39
  • Ubuntu:
    • Lucid: 5.1.73-0ubuntu0.10.04.1
    • Precise, raring, Saucy, Trusty, Utopic: >= 5.5.34-0ubuntu0.13.04
  • Redhat: 5.5.39

So i think this is enough to prove that MySQL 5.5.3 can be set as a requirement.
Currently i am working on a solution.
See here: https://github.com/andygrunwald/MailingListStats/commit/3ff9d1241804992f7dc760c3a7d9893bb536606b

But until now i was not successful.
I am fighting with an
sqlalchemy.exc.OperationalError: (OperationalError) (1054, "Unknown column 'mailing_list_url' in 'field list'") 'INSERT INTO messages_people (type_of_recipient, message_id, mailing_list_url, email_address) VALUES (%s, %s, %s, %s)' ('From', '<mailman.16.1085344651.25321.typo3-ug-denmark@lists.netfielders.de>', 'http://lists.typo3.org/pipermail/typo3-ug-denmark', '[email protected]') error.
But maybe the next days :)

@andygrunwald
Copy link
Contributor Author

I created a PR (#52) for this, because i finally fixed this for MySQL.
Afaik PostgreSQL is not affected. Can you please have a look at it ? :)

I tried to tested this with PostgreSQL as well, but i failed, but the same error message occur with master as well
The current master seems not to work with PostgreSQL 9.3:

`$ ./mlstats --db-driver 'postgres' --db-hostname 'localhost' --db-user 'operator' --db-password '' --db-name 'mlstats' 'https://lists.libresoft.es/pipermail/metrics-grimoire/'``

Retrieving https://lists.libresoft.es/pipermail/metrics-grimoire/2006-December.txt.gz...
...
Retrieving https://lists.libresoft.es/pipermail/metrics-grimoire/2014-July.txt...
Analyzing /Users/agrunwald/.mlstats/compressed/lists.libresoft.es/pipermail/metrics-grimoire/2006-December.txt.gz
   ***WARNING: 2 messages (out of 4) parsed but not stored (2 duplicate, 0 errors)***
Analyzing /Users/agrunwald/.mlstats/compressed/lists.libresoft.es/pipermail/metrics-grimoire/2007-January.txt.gz
Traceback (most recent call last):
  File "./mlstats", line 38, in <module>
    pymlstats.start()
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/__init__.py", line 166, in start
    quiet, force, web_user, web_password)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 173, in __init__
    t, s, np = self.__analyze_mailing_list(mailing_list)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 225, in __analyze_mailing_list
    total, stored, non_parsed = self.__analyze_list_of_files(mailing_list, archives_to_analyze)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/main.py", line 393, in __analyze_list_of_files
    mailing_list.location)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/db/session.py", line 155, in store_messages
    self.insert_people(name, email)
  File "/Users/agrunwald/Development/MailingListStats.git/pymlstats/db/session.py", line 81, in insert_people
    self.session.commit()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 768, in commit
    self.transaction.commit()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 370, in commit
    self._prepare_impl()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 350, in _prepare_impl
    self.session.flush()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 1907, in flush
    self._flush(objects)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 2025, in _flush
    transaction.rollback(_capture_exception=True)
  File "/Library/Python/2.7/site-packages/sqlalchemy/util/langhelpers.py", line 57, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/session.py", line 1989, in _flush
    flush_context.execute()
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 371, in execute
    rec.execute(self)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 524, in execute
    uow
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/persistence.py", line 64, in save_obj
    mapper, table, insert)
  File "/Library/Python/2.7/site-packages/sqlalchemy/orm/persistence.py", line 568, in _emit_insert_statements
    execute(statement, multiparams)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 727, in execute
    return meth(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 824, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1119, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 21: ordinal not in range(128)

@sduenas
Copy link
Member

sduenas commented Sep 23, 2014

@andygrunwald, sorry to be late on this. I think to short the indexes to 191 characters is a good solution. My only concern is with message_ID field but I think it will be extremely rare to find something longer than 191.

@andygrunwald
Copy link
Contributor Author

Thanks @sduenas.
I understand your concern.
To shorten the message id for messages_people.message_id and messages.message_ID is necessary, because both fields were used as index (foreign key + primary key).

I had a look at RFC 2822. Sadly there is no max. length defined of this field.

I suggest we will merge this change. 191 character is a lot of space.
If some mailinglist needs more space, we will have a look how to fix this, but we are not sure if this is a real case.

andygrunwald added a commit that referenced this issue Sep 24, 2014
Fix #51: Add support for 4 byte UTF-8 character in MySQL
@andygrunwald
Copy link
Contributor Author

@gpoo
Copy link
Member

gpoo commented Sep 24, 2014

I opened one of my databases and I ran:
select length(message_id) from messages
where length(message_id) > 150

I got 30 rows, one of them whose length is 180.

I would prefer to have some data (aka. some sampling) regarding to the message_id length from multiple mailing lists. The message_id is generated by the email client, and it could take anything to make it.

@gpoo gpoo reopened this Sep 24, 2014
@sduenas
Copy link
Member

sduenas commented Sep 25, 2014

Some samples below. As you will see, there are two messages on Xen mailing lists which their message id is larger than 191 characters:

|<!&!GAAAAAAAAACfwOfEi/CuRo1/DwEE//OdwoAAABgAAAAAAAAAn8DnxIvwrkaNfw8BBP/znWT4IQAAAAAAEAAAAD6Zxj3z1rBDrUqmH53m4RI1AAAARG9lcyBYZW4gMy4xIHVzZSBOQVQgZm9yIHZpcnR1YWwgbmV0d29yayBieSBkZWZhdWx0PwA=@126.com> |
| Xen user discussion <xen-users.lists.xensource.com> | 2007-05-29 14:23:07 |
[Xen-users] Recall: Does Xen 3.1 use NAT for virtual network by default? |

| <!&!GAAAAAAAAACfwOfEi/CuRo1/DwEE//OdwoAAABgAAAAAAAAAn8DnxIvwrkaNfw8BBP/znWT4IQAAAAAAEAAAAD6Zxj3z1rBDrUqmH53m4RI1AAAARG9lcyBYZW4gMy4xIHVzZSBOQVQgZm9yIHZpcnR1YWwgbmV0d29yayBieSBkZWZhdWx0PwA=@act.buaa.edu.cn> |
| Xen user discussion <xen-users.lists.xensource.com> | 2007-05-29 14:23:20 | 
| [Xen-users] Does Xen 3.1 use NAT for virtual network by default? |

I've used the next query to get the results.

SELECT COUNT(message_id) n,
SUM(IF(LENGTH(message_id) <= 32, 1, 0)) AS "=< 32",
SUM(IF(LENGTH(message_id) > 32 AND LENGTH(message_id) <= 64, 1, 0)) AS "=< 64", 
SUM(IF(LENGTH(message_id) > 64 AND LENGTH(message_id) <= 128, 1, 0)) AS "=< 128", 
SUM(IF(LENGTH(message_id) > 128 AND LENGTH(message_id) <= 150, 1, 0)) AS "=< 150", 
SUM(IF(LENGTH(message_id) > 150 AND LENGTH(message_id) <= 191, 1, 0)) AS "=< 191",
SUM(IF(LENGTH(message_id) > 191, 1, 0)) AS "> 191"
FROM messages;
  • Ceph
+-------+-------+-------+--------+--------+--------+-------+
| n     | =< 32 | =< 64 | =< 128 | =< 150 | =< 191 | > 191 |
+-------+-------+-------+--------+--------+--------+-------+
| 34093 |  8613 | 13945 |  11535 |      0 |      0 |     0 |
+-------+-------+-------+--------+--------+--------+-------+
  • Xen
+--------+-------+--------+--------+--------+--------+-------+
| n      | =< 32 | =< 64  | =< 128 | =< 150 | =< 191 | > 191 |
+--------+-------+--------+--------+--------+--------+-------+
| 327187 | 56985 | 229477 |  40722 |      0 |      1 |     2 |
+--------+-------+--------+--------+--------+--------+-------+
  • Linux kernel
+---------+--------+---------+--------+--------+--------+-------+
| n       | =< 32  | =< 64   | =< 128 | =< 150 | =< 191 | > 191 |
+---------+--------+---------+--------+--------+--------+-------+
| 1718226 | 340363 | 1256518 | 121327 |     15 |      3 |     0 |
+---------+--------+---------+--------+--------+--------+-------+
  • OpenStack
+-------+-------+-------+--------+--------+--------+-------+
| n     | =< 32 | =< 64 | =< 128 | =< 150 | =< 191 | > 191 |
+-------+-------+-------+--------+--------+--------+-------+
| 75594 | 16911 | 23588 |  35092 |      0 |      3 |     0 |
+-------+-------+-------+--------+--------+--------+-------+

@gpoo
Copy link
Member

gpoo commented Sep 25, 2014

Maybe a different approach could be used.

We know the problem with MySQL happens when the index is longer than a fixed number of bytes, not in the columns themselves. According to the index options for MySQL in SQLAlchemy, it seems possible to use something like:

Index('my_index', my_table.c.data, mysql_length=10)
Index('a_b_idx', my_table.c.a, my_table.c.b, mysql_length={'a': 4,
                                                           'b': 9})

@sduenas
Copy link
Member

sduenas commented Sep 26, 2014

What do you think about this other approach?

http://dev.mysql.com/doc/refman/5.0/en/charset-column.html

@andygrunwald
Copy link
Contributor Author

Sorry for my late reply.
At first i want to publish more length examples of message_id`s, but i think the examples by @sduenas are enough, because he already got two examples which are larger than 191 characters.
And to be honest this is enough to prove that this is not the final solution.

To sum up, we got two possibilities:

  1. Change the amount of possible letters per message_id column back and only limi the num of chars in the index
  2. Change only the message_id field back to utf8 instead of utf8mb4

If we use 1. we assume that 191 chars are enough for a PK to be unique enough. Currently the complete message-id is used as index / key / pk. If a message-id is longer than 191 chars we only use the first 191 chars as unique index.
Assuming that in the posted examples only 2 message_ids are bigger than 191 characters this might be ok / acceptable.

If we use solution 2. we accept that in 1 table there are two different charsets, because as we learned in this PR utf8 != utf8 (utf8mb4) in MySQL. I think this is a possible solution, but a kind of "bad practice". On the other hand a mail client vendor should not use UTF8-Character (3 or 4 byte, does not matter) in a message-id property. But afaik this i not strictly defined. But it is a mail header and this should be ASCII.

Hard to decide which solution should be used, because both has advantages and disadvantages.
Anyway. If we can agree on a solution i would be able to make the change and fix this "regression".

Another idea is to reflect the message-id as pure bytes not as string.

@gpoo / @sduenas:

@sduenas
Copy link
Member

sduenas commented Sep 29, 2014

My opinion goes more in the way of solution 2 but using utf8mb4 only for fields that could include these characters (i.e subject or message_body) and keeping utf8 on primary and foreign keys or on indexes. Following this approach, we will only have one table (messages) with two different character sets.

@gpoo
Copy link
Member

gpoo commented Sep 29, 2014

@sduenas +1

@andygrunwald
Copy link
Contributor Author

Fine for me, as well.
I will rollback commit fa0e19b and will make the new change (solution 2 with utf8mb4 fields for only fields which (maybe) includes this.
Of course there is a possibility that message-id include this characters as well, but the chance is really low.
I will link the commit here so that you got the change to have a look at it as well.

Thanks @gpoo and @sduenas for your opinions!

@andygrunwald
Copy link
Contributor Author

First: I had rollbacked this change in a68bcc0.
Second: After i had a look at it again i detected that for the error message, which was mentioned in the first post:

...
  File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
  File "/Library/Python/2.7/site-packages/MySQLdb/cursors.py", line 205, in execute
    self.errorhandler(self, exc, value)
  File "/Library/Python/2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
sqlalchemy.exc.OperationalError: (OperationalError) (1366, "Incorrect string value: '\\xE6\\xF8\\xE5' for column 'email_address' at row 1") 'INSERT INTO people (email_address, name, username, domain_name, top_level_domain) VALUES (%s, %s, %s, %s, %s)' ('none@none.\xe6\xf8\xe5', 'Lars Bonnesen', 'none', 'none.\xe6\xf8\xe5', '\xe6\xf8\xe5')

comes only from a non decoded email address.
The email address is none@none.æøå. Because this was not decoded, it can`t be stored in MySQL.
This was fixed in 11a5b04.

To test this UTF8MB4 thingy i sent two emails to the metrics grimoire mailing list:

The first post with emojii can't be displayed correctly.
The second post can be displayed, but with MySQL it can be only stored as foo?bar, because if Mailman / MLStats will download it it replaced the UTF8 4 Byte letter as ? see https://lists.libresoft.es/pipermail/metrics-grimoire/2014-September.txt.

Any ideas @gpoo + @sduenas?

@andygrunwald andygrunwald reopened this Sep 29, 2014
@gpoo
Copy link
Member

gpoo commented Aug 7, 2015

Instead of sending an email to a mailing list, I think it would be better to add a proper test case with a cooked mbox file :-) Like the ones in pymlstast/tests and pymlstast/tests/data

Nevertheless, this still can be done and it can be good to check when this is done and to avoid regressions in the future.

I have not looked at the actual error yet :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants