Proposal for better compression with non-ascii characters #11

grignaak · 2015-03-14T15:15:05Z

shoco_decompress can take a null-terminated ascii input string and return the same string. If that can be relaxed a little then I see an opportunity to better handle unicode characters.

The current scheme encodes non-ascii characters by prefixing each byte with the 0x00 code. This doubles the length of the character.

But many of the characters below 0x20 are non-printable and uncommon—except the whitespace characters in the range 0x09–0x0D inclusive. If these uncommon characters are treated like the non-ascii characters it gives room for encoding a length for subsequent special characters.

This is especially useful for utf-8 where the non-ascii characters are multiple bytes.

Take the example utf-8 strings "μ", "μδ" and "😁":

μ
Decoded_both....... cebc
Encoded_current.... 00ce 00bc
Encoded_proposed... 01ce bc

μδ
Decoded_both....... cebc ceb4
Encoded_current.... 00ce 00bc 00ce 00b4
Encoded_proposed... 03ce bcce b4

😁
Decoded_both....... f09f 9881
Encoded_current.... 00f0 009f 0098 0081
Encoded_proposed... 03f0 9f98 81

The length code gives the length of the following bytes less 1.

The text was updated successfully, but these errors were encountered:

Ed-von-Schleck · 2015-03-14T15:25:14Z

Hallo Michael,

that is correct, and if I recall correctly, I think I considered
something like that at some point. However, I didn't use shoco for
non-ASCII-encoded text – actually, I didn't really use it for
anything useful so far :) – and I kinda like the fact that any
ASCII-string is valid shoco-encoded text. I believe, though, that you
are probably right, and that little feature isn't really useful for
anyone.

Do you have a patch? I'd be happy to review it!

Cheers,
Christian

P.S. I'll answer your other mails shortly!

Am Sa, 14. Mär, 2015 um 4:15 schrieb Michael Deardeuff
[email protected]:

shoco_decompress can take a null-terminated ascii input string and
return the same string. If that can be relaxed a little then I see an
opportunity to better handle unicode characters.

The current scheme encodes non-ascii characters by prefixing each
byte with the 0x00 code. This doubles the length of the character.

But many of the characters below 0x20 are non-printable and
uncommon—except the whitespace characters in the range 0x09–0x0D
inclusive. If these uncommon characters are treated like the
non-ascii characters it gives room for encoding a length for
subsequent special characters.

This is especially useful for utf-8 where the non-ascii characters
are multiple bytes.

Let's the example utf-8 strings "μ" and "😁":

μ
Decoded_both....... cebc
Encoded_current.... 00ce 00bc
Encoded_proposed... 01ce bc

😁
Decoded_both....... f09f 9881
Encoded_current.... 00f0 009f 0098 0081
Encoded_proposed... 03f0 9f98 81
The length code gives the length of the following bytes less 1.

—
Reply to this email directly or view it on GitHub.

tmthrgd added a commit to tmthrgd/shoco that referenced this issue Feb 13, 2017

Implement better compression for non-ascii (see Ed-von-Schleck/shoco#11)

d5c20b2

lovell mentioned this issue Sep 25, 2018

Error: Memory allocated was shorter than necessary for multi-byte UTF-8 chars lovell/shorter#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for better compression with non-ascii characters #11

Proposal for better compression with non-ascii characters #11

grignaak commented Mar 14, 2015

Ed-von-Schleck commented Mar 14, 2015

Proposal for better compression with non-ascii characters #11

Proposal for better compression with non-ascii characters #11

Comments

grignaak commented Mar 14, 2015

Ed-von-Schleck commented Mar 14, 2015