Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enable the capability to specify zstd and lz4 segment compression via config #14008

Merged
merged 4 commits into from
Oct 21, 2024

Conversation

jackluo923
Copy link
Contributor

@jackluo923 jackluo923 commented Sep 16, 2024

Summary

This PR builds on #13782 by adding support for specifying segment compression using Zstandard and LZ4 via configuration. By default, segments are compressed with GZIP. However, users can configure the compression codec using the pinot.tar.compression.codec.name property. Currently, we support gzip, zstd, lz4, but adding other codecs (e.g., LZMA or Snappy) can be done with a single-line change.

Note that currently, this PR only brings the capability to a portion of well-tested Pinot server code path only. Specifically, compression during mutable to immutable segment generation and fetching segments from peer & deepstore. In the future, there is plan to progressively enable the functionality on more Pinot components.

Core Concepts

The key concept introduced in this PR is the .tar.compressed file extension, replacing hard-coded extensions like .tar.gz, .tar.zst, .tar.lz4, etc. This is especially useful and convenient to prevent mismatch of various hard-coded extensions during compression and decompression. When this extension is used, the default compressor (configurable at runtime, with GZIP as the default) is applied during compression. For decompression, widely used compressors (supported by the Apache Commons library) embed magic numbers at the file start, allowing Apache commons and many other compression libraries to automatic detect the correct decompression method from the compressed content itself. Therefore, the file extension doesn’t matter during decompression, and .tar.compressed serves as a generic placeholder for tar archives compressed with any codec.

The rest of the PR revolve around this concept and makes the following general changes:

  1. Change hard-coded .tar.gz strings when appropriate and references to TarCompressionUtils.TAR_GZ_FILE_EXTENSION static variable to TarCompressionUtils.TAR_COMPRESSED_FILE_EXTENSION
  2. Check if pinot.tar.compression.codec.name is specified in BaseServerStarter.java and if so set the default Tar compression codec accordingly.

Compatibility

Note that this PR maintains backward compatibility with existing Pinot code—segments and configs generated by previous versions will work with the updated code. However, there is no forward compatibility, as older Pinot versions cannot handle the new .tar.compressed file extension.

Important Files

This PR touches a lot of files. Important source code files to start code review is the following:

  1. TarCompressionUtils.java
  2. CommonConstants.java

@codecov-commenter
Copy link

codecov-commenter commented Sep 16, 2024

Codecov Report

Attention: Patch coverage is 43.47826% with 13 lines in your changes missing coverage. Please review.

Project coverage is 65.07%. Comparing base (59551e4) to head (1022788).
Report is 1079 commits behind head on master.

Files with missing lines Patch % Lines
...apache/pinot/common/utils/TarCompressionUtils.java 56.25% 6 Missing and 1 partial ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 4 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java 50.00% 1 Missing ⚠️
...a/manager/realtime/RealtimeSegmentDataManager.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14008      +/-   ##
============================================
+ Coverage     61.75%   65.07%   +3.31%     
- Complexity      207     1533    +1326     
============================================
  Files          2436     2564     +128     
  Lines        133233   140771    +7538     
  Branches      20636    21609     +973     
============================================
+ Hits          82274    91600    +9326     
+ Misses        44911    42426    -2485     
- Partials       6048     6745     +697     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 64.98% <43.47%> (+3.27%) ⬆️
java-21 64.95% <43.47%> (+3.33%) ⬆️
skip-bytebuffers-false 65.00% <43.47%> (+3.25%) ⬆️
skip-bytebuffers-true 64.93% <43.47%> (+37.20%) ⬆️
temurin 65.07% <43.47%> (+3.31%) ⬆️
unittests 65.06% <43.47%> (+3.31%) ⬆️
unittests1 56.51% <52.63%> (+9.62%) ⬆️
unittests2 35.06% <26.08%> (+7.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if (outputFile.getName().endsWith(supportedCompressorExtension)) {
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension);
createCompressedTarFile(inputFiles, outputFile, compressorName);
}
Copy link
Contributor

@deemoliu deemoliu Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break in if loop?

for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) {
if (outputFile.getName().endsWith(supportedCompressorExtension)) {
compressorName = COMPRESSOR_NAME_BY_FILE_EXTENSIONS.get(supportedCompressorExtension);
createCompressedTarFile(inputFiles, outputFile, compressorName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move common code createCompressedTarFile(inputFiles, outputFile, compressorName) after precondition check.

createCompressedTarFile(inputFiles, outputFile, _defaultCompressorName);
} else {
for (String supportedCompressorExtension : COMPRESSOR_NAME_BY_FILE_EXTENSIONS.keySet()) {
if (outputFile.getName().endsWith(supportedCompressorExtension)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can outputFile endswith ".tar.compressed" which is not a supported compressor file extension?

* appropriate compressor at run-time based on the file's magic number irrespective of the file extension.
* Compression uses the default compressor automatically if this generic extension is used.
*/
public static final String TAR_COMPRESSED_FILE_EXTENSION = ".tar.compressed";
Copy link
Contributor

@deemoliu deemoliu Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly good. we are able to compress and decompress different segment format (tar.gz, tar.zst, tar.compressed) even they appear in one table.

Curious, if we exposed and updated the value of _defaultCompressorName (line 86), how can we make sure the .tar.compressed files can still be decompress by updated compressor?

in another word, do we have test covered the scenarios for changing the default compressor and make sure the existing segments with (.tar.compressed) can be decompressed?

Copy link
Contributor Author

@jackluo923 jackluo923 Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test is here in Apache commons library. You can name the file extension to whatever you want, such as .tar.deemoliu and you'd still be able to decompress the segment. Decompression does not rely on the file extension to figure out the compressor to use for decompression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, so we using the first bytes to identify which decompressor to use. the apache library initializes the correct compressor for us.

@deemoliu deemoliu changed the title [WIP] [Feature] Enable the capability to specify zstd and lz4 segment compression via config [Feature] Enable the capability to specify zstd and lz4 segment compression via config Oct 16, 2024
@deemoliu deemoliu merged commit 1f8fd63 into apache:master Oct 21, 2024
21 checks passed
@deemoliu deemoliu added the incompatible Indicate PR that introduces backward incompatibility label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incompatible Indicate PR that introduces backward incompatibility ingestion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants