Skip to content

Commit

Permalink
deflate_compress: always generate complete Huffman codes
Browse files Browse the repository at this point in the history
Fix compatibility with some DEFLATE decompressors that don't correctly
handle some edge cases in the DEFLATE specification.

Fixes #323
  • Loading branch information
ebiggers committed Sep 10, 2023
1 parent 5753355 commit 6a7b5bb
Showing 1 changed file with 28 additions and 23 deletions.
51 changes: 28 additions & 23 deletions lib/deflate_compress.c
Original file line number Diff line number Diff line change
Expand Up @@ -1340,31 +1340,36 @@ deflate_make_huffman_code(unsigned num_syms, unsigned max_codeword_len,
* symbol and a nonzero frequency packed into a 32-bit integer.
*/

/*
* Handle special cases where only 0 or 1 symbols were used (had nonzero
* frequency).
*/

if (unlikely(num_used_syms == 0)) {
/* Handle the case where fewer than 2 symbols were used. */
if (unlikely(num_used_syms < 2)) {
/*
* Code is empty. sort_symbols() already set all lengths to 0,
* so there is nothing more to do.
*/
return;
}

if (unlikely(num_used_syms == 1)) {
/*
* Only one symbol was used, so we only need one codeword. But
* two codewords are needed to form the smallest complete
* Huffman code, which uses codewords 0 and 1. Therefore, we
* choose another symbol to which to assign a codeword. We use
* 0 (if the used symbol is not 0) or 1 (if the used symbol is
* 0). In either case, the lesser-valued symbol must be
* assigned codeword 0 so that the resulting code is canonical.
* The DEFLATE RFC allows the offset code to contain fewer than
* 2 codewords. The format itself can send 1 or more offset
* codeword lengths. In the case where only 1 is sent, the RFC
* explicitly allows the codeword to be of length 0 (to send an
* empty code) or length 1 (to send a code with a single
* codeword). The RFC does not explicitly say whether the
* litlen and pre codes can be incomplete in a similar way, even
* though an empty block uses only 1 litlen symbol.
*
* Regardless, some DEFLATE decompressors have a bug where they
* don't support these cases and require all Huffman codes to
* conntain at least 2 codewords. This bug existed in zlib
* 1.2.1 and earlier, and it apparently still exists in Windows
* Explorer (probably due to forking zlib and never updating
* it). See https://github.com/ebiggers/libdeflate/issues/323.
*
* Other DEFLATE encoders, including zlib's, always send at
* least 2 codewords in order to make a complete code.
* Therefore, this is a case where practice does not entirely
* match the specification. Therefore, we follow practice by
* sending 2 codewords of length 1: codeword '0' for symbol 0;
* and codeword '1' for another symbol -- the used symbol if it
* exists and is not symbol 0, otherwise symbol 1. This does
* worsen the compression ratio, but only by a very tiny amount,
* and only when 'num_used_syms < 2' which is very rare anyway.
*/

unsigned sym = A[0] & SYMBOL_MASK;
unsigned sym = num_used_syms ? (A[0] & SYMBOL_MASK) : 0;
unsigned nonzero_idx = sym ? sym : 1;

codewords[0] = 0;
Expand Down

0 comments on commit 6a7b5bb

Please sign in to comment.