FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

ehrmann · 2012-12-14T19:42:57Z

Why does this method encode a four-byte offset when offset is only two bytes?

public void encodeSubstring(int offset, int length, Object context) {
        ...
        offset = -offset;
        if (offset < 1 || offset > (2<<15)-1) {
            throw new IllegalArgumentException("Offset " + offset + " out of range [1, 65535]");
        }
        encoder.encodeSymbol(offset & 0xf);
        encoder.encodeSymbol((offset >> 4) & 0xf);
        encoder.encodeSymbol((offset >> 8) & 0xf);
        encoder.encodeSymbol((offset >> 12) & 0xf);
        ...
}

The text was updated successfully, but these errors were encountered:

gtoubassi · 2012-12-14T21:52:43Z

Hi Dave! Thanks for the question. It is encoding the 2 byte offset as 4 symbols, one per nibble (half byte). The encoder creates a huffman coding table for each nibble, and it turns out that scheme is more efficient (at least experimentally determined by me) then encoding it as 2 single byte values, each with its own huffman table. I believe the reason for this is that most of the length values fit in a nibble, so separating that high nibble out which is often a zero is high value.

So in short, the values are not written out directly, they are written as symbols compressed with huffman coding, and you have to make a decision as to how to map an integer into a huffman encoding (do you have 1 huffman table for the entire 4 bytes? break it into 2 bytes and have 2 tables, 1 for each part? Break it into 4 parts and have 4 tables)?

Please let me know if that doesn't make sense and I can explain further.

ehrmann · 2012-12-17T00:09:57Z

I think that mostly makes sense, but I'm still a little surprised that huffman encoding beats a varint.

Have you gotten a chance to look at some of my pull requests? I think I had a few performance tweaks, a build fix, and a bug fix for the Java version.

gtoubassi · 2012-12-26T16:34:00Z

I assume when you say varint you are referring to an ad hoc encoding that tries to encode a small number as a single byte, and then uses a sentinel value to indicate that the subsequent bytes represent the full number? Is there a specific implementation you have in mind that I could contemplate? One thing to keep in mind is that the numbers being encoded are often very small (like < 16). So I guess the basic question is what is the best variable encoding for numbers that are often very very small but can range up to 65k. Huffman encoding should always beat it since its a similar encoding, but optimized for the data set at hand. In fact assuming you have suitable sample data to generate the model, the huffman coding is provably optimal for a symbol to symbol encoding (vs say arithmetic coding, which I found empirically to not be much better then huffman coding and comes with significant performance cost and lots of IP fud).

Perhaps whats more surprising is that I found breaking the "short" into 4 encoded nibbles was best.

Regarding your pull requests. All the changes look good to me although I was wondering about the intro of the "Match" object (which I commented on). I'm inclined to believe you but for such a change want to know there was something empirical backing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

ehrmann commented Dec 14, 2012

gtoubassi commented Dec 14, 2012

ehrmann commented Dec 17, 2012

gtoubassi commented Dec 26, 2012

FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

FemtoZipCompressionModel.encodeSubstring encodes a four-byte offset #5

Comments

ehrmann commented Dec 14, 2012

gtoubassi commented Dec 14, 2012

ehrmann commented Dec 17, 2012

gtoubassi commented Dec 26, 2012