Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for better compression with non-ascii characters #11

Open
grignaak opened this issue Mar 14, 2015 · 1 comment
Open

Proposal for better compression with non-ascii characters #11

grignaak opened this issue Mar 14, 2015 · 1 comment

Comments

@grignaak
Copy link
Contributor

shoco_decompress can take a null-terminated ascii input string and return the same string. If that can be relaxed a little then I see an opportunity to better handle unicode characters.

The current scheme encodes non-ascii characters by prefixing each byte with the 0x00 code. This doubles the length of the character.

But many of the characters below 0x20 are non-printable and uncommon—except the whitespace characters in the range 0x09–0x0D inclusive. If these uncommon characters are treated like the non-ascii characters it gives room for encoding a length for subsequent special characters.

This is especially useful for utf-8 where the non-ascii characters are multiple bytes.

Take the example utf-8 strings "μ", "μδ" and "😁":

μ
Decoded_both....... cebc
Encoded_current.... 00ce 00bc
Encoded_proposed... 01ce bc

μδ
Decoded_both....... cebc ceb4
Encoded_current.... 00ce 00bc 00ce 00b4
Encoded_proposed... 03ce bcce b4

😁
Decoded_both....... f09f 9881
Encoded_current.... 00f0 009f 0098 0081
Encoded_proposed... 03f0 9f98 81

The length code gives the length of the following bytes less 1.

@Ed-von-Schleck
Copy link
Owner

Hallo Michael,

that is correct, and if I recall correctly, I think I considered
something like that at some point. However, I didn't use shoco for
non-ASCII-encoded text – actually, I didn't really use it for
anything useful so far :) – and I kinda like the fact that any
ASCII-string is valid shoco-encoded text. I believe, though, that you
are probably right, and that little feature isn't really useful for
anyone.

Do you have a patch? I'd be happy to review it!

Cheers,
Christian

P.S. I'll answer your other mails shortly!

Am Sa, 14. Mär, 2015 um 4:15 schrieb Michael Deardeuff
[email protected]:

shoco_decompress can take a null-terminated ascii input string and
return the same string. If that can be relaxed a little then I see an
opportunity to better handle unicode characters.

The current scheme encodes non-ascii characters by prefixing each
byte with the 0x00 code. This doubles the length of the character.

But many of the characters below 0x20 are non-printable and
uncommon—except the whitespace characters in the range 0x09–0x0D
inclusive. If these uncommon characters are treated like the
non-ascii characters it gives room for encoding a length for
subsequent special characters.

This is especially useful for utf-8 where the non-ascii characters
are multiple bytes.

Let's the example utf-8 strings "μ" and "😁":

μ
Decoded_both....... cebc
Encoded_current.... 00ce 00bc
Encoded_proposed... 01ce bc

😁
Decoded_both....... f09f 9881
Encoded_current.... 00f0 009f 0098 0081
Encoded_proposed... 03f0 9f98 81
The length code gives the length of the following bytes less 1.


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants