Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

mkitti · 2024-05-18T23:18:19Z

Introduction

The Zstandard plugin for HDF5 should be modified to allow for an unknown decompressed size in the frame header.

Currently, the Zstd decompression scheme, following from the original implemention, uses ZSTD_getDecompressedSize to obtain the size of the decompressed buffer. The returned value is not validated and passed directly to malloc.

hdf5_plugins/ZSTD/src/H5Zzstd.c

Lines 59 to 60 in 770d70a

    
           size_t decompSize = ZSTD_getDecompressedSize(*buf, origSize); 
        
           if (NULL == (outbuf = malloc(decompSize)))

ZSTD_getDecompressedSize returns 0 if the decompressed size is empty, unknown, or an error has occured. If malloc is asked to allocate 0 bytes, it will return NULL, resulting in returning an error condition. This is an incorrect result if the decompressed size is actually empty or unknown and there is no actual error. ZSTD_getDecompressedSize is obsolete.

ZSTD_getFrameContentSize should replace the use of ZSTD_getDecompressedSize. ZSTD_getFrameContentSize distinguishes between empty, unknown, or an error. The unknown or error states are indicated by a return value of ZSTD_CONTENTSIZE_UNKNOWN or ZSTD_CONTENTSIZE_ERROR, respectively.

The unknown decompression state is common. This occurs when the compression is done via the streaming API via ZSTD_compressStream or ZSTD_compressStream2. ZSTD_compressStream2 in particular only stores the frame size when either ZSTD_e_end is provided on the initial call or ZSTD_CCtx_setPledgedSrcSize is used.

Tasks

Use ZSTD_getFrameContentSize instead of the obsolete ZSTD_getDecompressedSize to correctly distinguish between empty, unknown, or error states when determining the decompressed size.
Recognize the unknown size state by checking the return value of ZSTD_getFrameContentSize against ZSTD_CONTENTSIZE_UNKNOWN
Address unknown size state by growing buffer if needed or using stream decompression API via ZSTD_decompressStream
- The HDF5 library should know the expected number of bytes for a chunk. We should not be relying on the filter to figure this out.

References

[1] https://facebook.github.io/zstd/zstd_manual.html

The text was updated successfully, but these errors were encountered:

byrnHDF · 2024-11-18T18:35:49Z

ZSTD_getFrameContentSize change has been implemented - but does not check for UNKNOWN.

mkitti · 2024-11-18T22:23:42Z

@byrnHDF If it is helpful, I recently implemented ZSTD_getFrameContentSize with a check for ZSTD_CONTENTSIZE_UNKNOWN in C++ for numcodecs.js via WASM. The code is mostly just plain C though:

https://github.com/manzt/numcodecs.js/blob/fa8a86065ce42ca024ecee8d376a440bdd304ac1/codecs/zstd/zstd_codec.cpp#L52-L128

byrnHDF · 2024-11-18T23:18:40Z

That does help to see the logic. However, wouldn't this only be a problem if someone created an hdf5 file outside of the hdf5 library? Because the HDF5 filter wouldn't be using the stream version?

mkitti · 2024-11-18T23:46:46Z

The HDF5 C library does offer multiple APIs to read and write raw chunks. In particular, one could write a raw chunk via H5Dwrite_chunk. In this case, a 3rd party could employ Zstandard compression, perhaps even in parallel across multiple threads, and then use H5Dwrite_chunk to write the chunks using a single thread.

There are multiple instances where this approach is being taken by those interested in accelerated compression:

@FrancescAlted for example has presented about using a multithreaded parallel approach for Blosc2:
https://www.hdfgroup.org/wp-content/uploads/2022/05/Blosc2-and-HDF5-European-HUG2022.pdf
@ImarisRyder has used this approach in ImarisWriter for GZIP and LZ4 compression:
https://github.com/imaris/ImarisWriter/blob/b128e6e7d1a147261e9d5caf24ebc6b5c9c63779/writer/bpWriterHDF5.cxx#L541
The Julia package JLD2.jl writes HDF5 files independently of the C library, including Zstandard compression:
Add Zstd compression JuliaIO/JLD2.jl#560

Thus, we cannot assume that all chunks in a HDF5 container would have been compressed by the compression routine present in this repository. It is possible that streaming compression was used to encode a chunk, and the decompressed size was not encoded in the frame header.

I have specifically encountered this specific issue in real world data when working with with Zarr datasets or N5 datasets. I anticipate that this may occur with HDF5 datasets at some point, and I hope that this reference implementation would be robust enough to deal with any Zstandard compressed data given its increasing ubiquity.

By the way, here are some additional examples in the Zstandard repository itself:
https://github.com/facebook/zstd/blob/dev/examples/streaming_memory_usage.c
https://github.com/facebook/zstd/blob/dev/examples/streaming_decompression.c

derobins added the Filter - ZSTD label Aug 15, 2024

derobins added this to the HDF5 1.14.5 Release milestone Aug 15, 2024

derobins added Type - Improvement Priority - 1. High 🔼 These are important issues that should be resolved in the next release labels Aug 15, 2024

derobins assigned derobins and byrnHDF Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

mkitti commented May 18, 2024 •

edited

Loading

byrnHDF commented Nov 18, 2024

mkitti commented Nov 18, 2024

byrnHDF commented Nov 18, 2024

mkitti commented Nov 18, 2024

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression #116

Comments

mkitti commented May 18, 2024 • edited Loading

Introduction

Tasks

References

byrnHDF commented Nov 18, 2024

mkitti commented Nov 18, 2024

byrnHDF commented Nov 18, 2024

mkitti commented Nov 18, 2024

mkitti commented May 18, 2024 •

edited

Loading