PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

zhangjiashen · 2023-11-04T19:08:48Z

The spec PARQUET-2257 has added bloom_filter_length for reader to load the bloom filter in a single shot. This implementation alters the code to make use of the 'bloom_filter_length' field for loading the bloom filter (consisting of the header and bitset) in order to enhance I/O scheduling.

Jira

My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-2373
- relates to: https://issues.apache.org/jira/browse/PARQUET-2257, PARQUET-41: Add bloom filter #757

Tests

Unit tests and existing tests in TestParquetFileWriter, TestParquetMetadataConverter, ParquetRewriterTest etc.

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

wgtmac

Thanks for the change! I have left some comments.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java

wgtmac · 2023-11-24T02:41:43Z

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

zhangjiashen · 2023-11-27T06:36:44Z

@zhangjiashen This can be rebased to adopt parquet-format 2.10.0

@wgtmac I just rebased with master branch and please help take a look when you get a chance?

mapleFU · 2023-11-27T06:58:48Z

FYI, I've update a BloomFilter with length for testing: apache/parquet-testing#43

wgtmac · 2023-12-05T15:59:05Z

@zhangjiashen This needs to be rebased again.

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataIndexBloomEncodingStats.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

zhangjiashen changed the title ~~PARQUET-2373: Improve performance with bloom_filter_length~~ PARQUET-2373: Improve I/O performance with bloom_filter_length Nov 4, 2023

wgtmac requested changes Nov 5, 2023

View reviewed changes

zhangjiashen force-pushed the bloom_filter_length branch from 78e7941 to 9f2738f Compare November 27, 2023 06:34

zhangjiashen force-pushed the bloom_filter_length branch from 9f2738f to 83a9777 Compare December 9, 2023 21:47

wgtmac reviewed Dec 28, 2023

View reviewed changes

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestDataIndexBloomEncodingStats.java Outdated Show resolved Hide resolved

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java Outdated Show resolved Hide resolved

Improve performance with bloom_filter_length

2b1f135

zhangjiashen force-pushed the bloom_filter_length branch from c36e236 to 2b1f135 Compare January 13, 2024 21:54

wgtmac approved these changes Jan 14, 2024

View reviewed changes

wgtmac merged commit eface4d into apache:master Jan 27, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

zhangjiashen commented Nov 4, 2023 •

edited

Loading

wgtmac left a comment

wgtmac commented Nov 24, 2023

zhangjiashen commented Nov 27, 2023

mapleFU commented Nov 27, 2023

wgtmac commented Dec 5, 2023

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

PARQUET-2373: Improve I/O performance with bloom_filter_length #1184

Conversation

zhangjiashen commented Nov 4, 2023 • edited Loading

Jira

Tests

Commits

Documentation

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac commented Nov 24, 2023

zhangjiashen commented Nov 27, 2023

mapleFU commented Nov 27, 2023

wgtmac commented Dec 5, 2023

zhangjiashen commented Nov 4, 2023 •

edited

Loading