-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encodings reported by chardetng-py don't always match up to python's decoding #11
Comments
Actually it might be even more complicated than that. https://bugs.python.org/issue25416 I think it's possible that encoding_rs and the python codec will have different output for legacy single byte encodings |
For what it’s worth, I went through and:
These mismatches are definitely not ideal, but I think are also small enough not to be a problem (at least for the use case here: detecting the encoding of some bytes; not actually decoding/encoding them). None of these bytes show up in chardetng’s frequency data in On the other hand, the |
Ah, some historical details on the ones that differ between WHATWG and Unicode, if you’re curious:
|
Thanks for looking into that. So it looks like if you use python to decode, it might not be correct. I think probably it's best to just document this and move on. |
I'm going to keep this issue open because it's lacking documentation, but it looks like the functional problems are well-addressed. |
Yes, good point! I should not have marked that PR as solving this. I started looking into the multibyte encodings (I think I covered the single-byte ones well), and it seems like the situation is a bit more messy. I’m leaving some notes here so I or someone can document based on them. Multibyte codecs that Chardetng detects:
GBKSee #149 for details. Basically, gb18030 is a superset of gbk, which is a superset of gb2312. Chardetng expects decoding with encoding_rs, which treats all of them the same (as gb18030). They are all different decoders in Python, though, so we should (and now do) transform GBK → gb18030, which will work as a decoder in Python for all [correctly detected] cases. Shift_JISOh dear, this is messy and complicated. Shift_JIS turns out to be a really old and limited spec that has been extended in a multitude of ways that are incompatible with each other.
(Caveat: if you want/need to behave like a web browser, skip step 2). I think it would be lovely if that behavior was bundled up into the The most popular versions of Shift_JIS are:
WHATWG (and Chardetng) treats Windows-932 as Shift_JIS. So technically if Chardetng detects So if Chardetng returns
(Caveat: if you want/need to behave like a web browser, skip step 2). In an ideal world, I think that behavior would be bundled up into the I didn’t have a chance to dive into the other multibyte ones yet. Quick notes on EUC-KR and Big5, but these need more investigation: EUC-KRWHATWG’s and encoding_rs’s definition of EUC-KR is actually UHC/Windows-949, which is an extension on the original EUC-KR. I’m not sure whether that’s a strict superset (in which case this is straightforward and we should swap the name like we did for GBK/gb18030) or not, or any idiosyncracies in how Python treats the two (they are at least separate codecs in Python). Big5WHATWG’s and encoding_rs’s definition of Big5 is actually Big5-HKSCS. I’m not sure how much a strict superset this is, or anything else here. Needs more research. ISO-2022-JPHaven’t looked into at all yet. EUC-JPHaven’t looked into at all yet. |
Hm, I think writing up a condensed version of this--something like "these encodings tend to be difficult in general: "--and putting it in the docs might be sufficient. It's obviously pretty tricky and I appreciate you looking into it. If someone really wants to do what a browser might do, they should probably take the binary stream and pass it directly to |
That’s fair. Given that, I think we should probably also (in addition to the docs) rename I’m still hoping to write some docs for this, but want to do the research on the other remaining multibyte codecs here first. |
More research: ISO-2022-JPℹ️ TL;DR: None of Python’s built-in encodings is quite equivalent to the WHATWG version of this, but ISO-2022-JP has a similarly complicated backstory to GBK with incompatible families of extensions. ISO-2022 is a structure for making a codec that is stateful and switches between different sub-encodings when it encounters certain escape sequences (e.g. when it encounters the bytes Note: JIS X NNNN is an encoding standard, and JIS X NNNN-YYYY is the version of that standard published in year YYYY.
So in Python, nothing is quite equivalent to the WHATWG version (which is what Chardetng is guessing when it says If we are going ahead with remapping names for compatibility, it probably makes sense to map EUC-JPℹ️ TL;DR: As far as decoding goes WHATWG/chardet/encoding_rs’s concept of EUC encodings are like ISO-2022 and Shift-JIS in that they contain several sub-encodings. Like ISO-2022, they support more/are a bit more flexible than Shift-JIS, but uses a switching mechanism based on what bytes are allowed where, like Shift-JIS, rather than statefully switching modes like ISO-2022. (If you’re familiar with UTF-8, it’s similar.) For Japanese, there are three popular versions:
Neither the WHATWG standard nor encoding_rs/chardetng use anything from JIS X 0213, and pretty much exactly match behavior. We don’t need to do anything special for this family of encodings. Still remaining:
|
That all makes sense to me. I wonder if documenting it is sufficient, or if we should emit a warning of some kind. By putting the mapping logic directly into the rust code, we have created a minor problem if the user wants to know what chardetng actually outputs. For instance, if they want to pass the output to a binding of encoding_rs. Additionally, if we want to emit a warning using the python I think perhaps deviating very slightly from the rust struct and adding a |
So my feel on this is:
(Side note: FWIW, after having read up and thought about it more, I think Some ideas for splitting the difference here…
(Edit: updated with a note on idea 2 and added idea 3, which I originally added as a separate comment: #11 (comment). Also fixed some typos where I wrote |
Interesting, either way. I'll need to think about it for a while. |
Oops, forgot a third: have |
Alright, wrapping up with the final two: EUC-KRℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of EUC-KR is equivalent to Python’s EUC-KR is structured like EUC-JP, but just uses different sub-encodings. The original version had some major drawbacks, so both Microsoft and Apple developed their own extended versions:
This one is ultimately pretty simple. WHATWG’s Big5ℹ️ TL;DR: As far as decoding goes, WHATWG/chardet/encoding_rs’s concept of Big5 is a mix of the Big5-HKSCS extension (Python: The backstory on Big5 is pretty complex! Big5 was created in a somewhat ad-hoc way by by various computer manufacturers in Taiwan, and was eventually standardized for interoperability. It was still pretty limited, though, so there there is a whole mess of more specialized encodings that extend it. Some popular branches of the tree of extensions:
Since both Windows-950 and HKSCS were quite popular, WHATWG wound up standardizing on a combination of the two. It seems like it is basically HKSCS-2016 + any additional byte sequences that don’t work in it but do in Windows-950. This basically works out to HKSCS + 12 characters:
Unfortunately, there is nothing like this built-in for Python. First off, the Anyway!
(Edit on 2023-11-13: Rewrote the section on Big5 when I ran into some edge cases today. It’s now much more accurate and detailed.) |
Overall summary of encodings and their support/similarities/differences between WHATWG/encoding_rs and Python’s built-in codecs:
Notes:
(Edit on 2023-11-13: Updated the entry for Big5. I wound up tripping over the edge cases on it today and discovered it it’s the weirdest one here; WHATWG just sort of slammed two similar encodings together for it. I’ve updated the earlier, detailed comment for Big5 with more info.) |
This is overkill in the best possible terms. I definitely think it makes sense to document this behavior. I would almost immediate accept a PR that updates the docs with some information on the more difficult charsets. |
I’m happy to try and wrap this up into something more condensed and readable, but I think it would be good to figure out the actual strategy for what (if any) names are getting remapped under what circumstances first (don’t need to have implemented it yet). |
Quick update: I ran into some weird behavior with Big5 today and had more energy to dive into the details. It turns out to be kinda weird! I updated my detailed comments above if you want to know more, but the short version is: it’s the only one where there is not a clear equivalent/safe superset in Python’s built-in codecs. |
In a previous version, I just passed the entire buffer to encoding_rs and had it handle decoding entirely in rust, but that was a little confusing.
Need more robust aliases
The text was updated successfully, but these errors were encountered: