-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enable the capability to specify zstd and lz4 segment compression via config #14008
[Feature] Enable the capability to specify zstd and lz4 segment compression via config #14008
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14008 +/- ##
============================================
+ Coverage 61.75% 65.07% +3.31%
- Complexity 207 1533 +1326
============================================
Files 2436 2564 +128
Lines 133233 140771 +7538
Branches 20636 21609 +973
============================================
+ Hits 82274 91600 +9326
+ Misses 44911 42426 -2485
- Partials 6048 6745 +697
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if (outputFile.getName().endsWith(supportedCompressorExtension)) { | ||
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension); | ||
createCompressedTarFile(inputFiles, outputFile, compressorName); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break in if loop?
for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) { | ||
if (outputFile.getName().endsWith(supportedCompressorExtension)) { | ||
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension); | ||
createCompressedTarFile(inputFiles, outputFile, compressorName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can move common code createCompressedTarFile(inputFiles, outputFile, compressorName) after precondition check.
createCompressedTarFile(inputFiles, outputFile, _defaultCompressorName); | ||
} else { | ||
for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) { | ||
if (outputFile.getName().endsWith(supportedCompressorExtension)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can outputFile endswith ".tar.compressed" which is not a supported compressor file extension?
* appropriate compressor at run-time based on the file's magic number irrespective of the file extension. | ||
* Compression uses the default compressor automatically if this generic extension is used. | ||
*/ | ||
public static final String TAR_COMPRESSED_FILE_EXTENSION = ".tar.compressed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly good. we are able to compress and decompress different segment format (tar.gz, tar.zst, tar.compressed) even they appear in one table.
Curious, if we exposed and updated the value of _defaultCompressorName (line 86), how can we make sure the .tar.compressed
files can still be decompress by updated compressor?
in another word, do we have test covered the scenarios for changing the default compressor and make sure the existing segments with (.tar.compressed) can be decompressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit test is here in Apache commons library. You can name the file extension to whatever you want, such as .tar.deemoliu
and you'd still be able to decompress the segment. Decompression does not rely on the file extension to figure out the compressor to use for decompression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, so we using the first bytes to identify which decompressor to use. the apache library initializes the correct compressor for us.
Summary
This PR builds on #13782 by adding support for specifying segment compression using Zstandard and LZ4 via configuration. By default, segments are compressed with GZIP. However, users can configure the compression codec using the
pinot.tar.compression.codec.name
property. Currently, we supportgzip
,zstd
,lz4
, but adding other codecs (e.g., LZMA or Snappy) can be done with a single-line change.Note that currently, this PR only brings the capability to a portion of well-tested Pinot server code path only. Specifically, compression during mutable to immutable segment generation and fetching segments from peer & deepstore. In the future, there is plan to progressively enable the functionality on more Pinot components.
Core Concepts
The key concept introduced in this PR is the
.tar.compressed
file extension, replacing hard-coded extensions like.tar.gz
,.tar.zst
,.tar.lz4
, etc. This is especially useful and convenient to prevent mismatch of various hard-coded extensions during compression and decompression. When this extension is used, the default compressor (configurable at runtime, with GZIP as the default) is applied during compression. For decompression, widely used compressors (supported by the Apache Commons library) embed magic numbers at the file start, allowing Apache commons and many other compression libraries to automatic detect the correct decompression method from the compressed content itself. Therefore, the file extension doesn’t matter during decompression, and.tar.compressed
serves as a generic placeholder for tar archives compressed with any codec.The rest of the PR revolve around this concept and makes the following general changes:
.tar.gz
strings when appropriate and references toTarCompressionUtils.TAR_GZ_FILE_EXTENSION
static variable toTarCompressionUtils.TAR_COMPRESSED_FILE_EXTENSION
pinot.tar.compression.codec.name
is specified inBaseServerStarter.java
and if so set the default Tar compression codec accordingly.Compatibility
Note that this PR maintains backward compatibility with existing Pinot code—segments and configs generated by previous versions will work with the updated code. However, there is no forward compatibility, as older Pinot versions cannot handle the new
.tar.compressed
file extension.Important Files
This PR touches a lot of files. Important source code files to start code review is the following: