Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

Merged
merged 1 commit into from
Jan 27, 2024

Conversation

zhangjiashen
Copy link
Contributor

@zhangjiashen zhangjiashen commented Nov 4, 2023

The spec PARQUET-2257 has added bloom_filter_length for reader to load the bloom filter in a single shot. This implementation alters the code to make use of the 'bloom_filter_length' field for loading the bloom filter (consisting of the header and bitset) in order to enhance I/O scheduling.

Jira

Tests

  • Unit tests and existing tests in TestParquetFileWriter, TestParquetMetadataConverter, ParquetRewriterTest etc.

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@zhangjiashen zhangjiashen changed the title PARQUET-2373: Improve performance with bloom_filter_length PARQUET-2373: Improve I/O performance with bloom_filter_length Nov 4, 2023
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I have left some comments.

@wgtmac
Copy link
Member

wgtmac commented Nov 24, 2023

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

@zhangjiashen
Copy link
Contributor Author

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

@wgtmac I just rebased with master branch and please help take a look when you get a chance?

@mapleFU
Copy link
Member

mapleFU commented Nov 27, 2023

FYI, I've update a BloomFilter with length for testing: apache/parquet-testing#43

@wgtmac
Copy link
Member

wgtmac commented Dec 5, 2023

@zhangjiashen This needs to be rebased again.

@wgtmac wgtmac merged commit eface4d into apache:master Jan 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants