-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should String Elements and UTF-8 Elements be handled differently? #134
Comments
Accepting is nearly ok, writing on the other hand is to be avoided if we had a strict mode. @robUx4 do you have opinions? What do you do in the reference implementation? |
Just my opinion, but I prefer being compliant with the specification in order to have a clear and unanmbiguous implementation. Is a big effort converting bytes into UTF-8 or ASCII in terms of memory and time consumption? Perhaps I'm wrong, but we need these conversions to write a compliant matroska file. Since we have to do that for muxing can't we apply the same ones for parsing? |
You do not have to convert anything, you have to reject non-ascii if your target is an ascii string. |
Is this spec compliant? If we find an ascii string I suppose we should save it somewhere. If it is non-ascii, when there should be an ascii string, doesn't it mean that we have a corrupted input? Just posing questions to see if I've understood the problem correctly |
Given that ascii and utf8 have the same representation for what's in ascii, in order to produce always valid files we have to reject non-ascii-for-ascii or put a placeholder instead of an utf-8 character depending on how lenient we want to be. On consumption, I would only reject non-utf8 no matter the source. |
You should not be able to write UTF-8 data in an ASCII string. When reading you may also want to check the string is really ASCII, although I'm not aware of any reader doing so. However since this is Rust it may not be nice to pass around UTF-8 data pretending to the ASCII. |
As a note to the person implementing this (maybe me soon):
|
@FreezyLemon |
EBML differentiates between String Elements and UTF-8 Elements.
They are almost exactly the same, except for this:
Since UTF-8 is compatible with ASCII, we can just treat both Elements like UTF-8 without any parsing failures. And it is what the library currently does:
matroska/src/ebml/parse.rs
Lines 80 to 84 in 7f15b7c
However, this is technically not spec-compliant as we should reject non-ASCII (or terminator) values if we have a String Element, while they may be allowed for UTF-8 Elements.
A small overview:
String
String
+ new typefrom_utf8
(std)from_utf8
(std) +from_ascii
(custom, maybe faster?)Honestly, I can't think of many advantages to implementing this. We won't even "save" any memory because Rust strings are all UTF-8 internally (so a String made up of only ASCII will only use one byte per character regardless). But I wanted to at least put this information somewhere. What do you think?
The text was updated successfully, but these errors were encountered: