-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for better compression with non-ascii characters #11
Comments
Hallo Michael, that is correct, and if I recall correctly, I think I considered Do you have a patch? I'd be happy to review it! Cheers, P.S. I'll answer your other mails shortly! Am Sa, 14. Mär, 2015 um 4:15 schrieb Michael Deardeuff
|
shoco_decompress
can take a null-terminated ascii input string and return the same string. If that can be relaxed a little then I see an opportunity to better handle unicode characters.The current scheme encodes non-ascii characters by prefixing each byte with the
0x00
code. This doubles the length of the character.But many of the characters below
0x20
are non-printable and uncommon—except the whitespace characters in the range0x09–0x0D
inclusive. If these uncommon characters are treated like the non-ascii characters it gives room for encoding a length for subsequent special characters.This is especially useful for utf-8 where the non-ascii characters are multiple bytes.
Take the example utf-8 strings "
μ
", "μδ
" and "😁
":The length code gives the length of the following bytes less 1.
The text was updated successfully, but these errors were encountered: