Skip to content

Commit

Permalink
Fix read parquet for different encodings across row groups (facebooki…
Browse files Browse the repository at this point in the history
…ncubator#9129)

Summary:
When the parquet file contains a column that uses both plain and dictionary encodings across row groups, the output after filter can be wrong. The root cause is that `returnReaderNulls_` in `SelectiveColumnReader` may not be reset when the column encoding switches from dictionary encoding to plain encoding.

Here is the parquet metadata for the test file to reproduce this issue. `n_2` is dictionary encoded in row group 2 and is plain encoded in row group 3.

```
file:        file:/home/sparkuser/github/oap-project/velox/velox/dwio/parquet/tests/examples/different_encodings_with_filter.parquet
creator:     parquet-cpp-arrow version 15.0.0
extra:       ARROW:schema = /////9gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABwAAAAMAAAAAQAAACs////AAABBRAAAAAYAAAABAAAAAAAAAADAAAAbl8yAAQABAAEAAAA1P///wAAAQIQAAAAFAAAAAQAAAAAAAAAAwAAAG5fMQDE////AAAAASAAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQIQAAAAHAAAAAQAAAAAAAAAAwAAAG5fMAAIAAwACAAHAAgAAAAAAAABIAAAAAAAAAA=

file schema: schema
--------------------------------------------------------------------------------
n_0:         OPTIONAL INT32 R:0 D:1
n_1:         OPTIONAL INT32 R:0 D:1
n_2:         OPTIONAL BINARY O:UTF8 R:0 D:1

row group 1: RC:5 TS:185 OFFSET:4
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:4 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:143 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:276 SZ:45/43/0.96 VC:5 ENC:RLE,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 2: RC:5 TS:202 OFFSET:369
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:369 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 5, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:509 SZ:65/71/1.09 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 2, num_nulls: 0]
n_2:          BINARY SNAPPY DO:642 FPO:668 SZ:64/60/0.94 VC:5 ENC:RLE,RLE_DICTIONARY,PLAIN ST:[min: A, max: B, num_nulls: 3]

row group 3: RC:5 TS:195 OFFSET:766
--------------------------------------------------------------------------------
n_0:          INT32 SNAPPY DO:0 FPO:766 SZ:73/71/0.97 VC:5 ENC:RLE,PLAIN ST:[min: 6, max: 10, num_nulls: 0]
n_1:          INT32 SNAPPY DO:0 FPO:907 SZ:72/71/0.99 VC:5 ENC:RLE,PLAIN ST:[min: 1, max: 3, num_nulls: 0]
n_2:          BINARY SNAPPY DO:0 FPO:1047 SZ:55/53/0.96 VC:5 ENC:RLE,PLAIN ST:[min: F, max: J, num_nulls: 1]

```

Pull Request resolved: facebookincubator#9129

Reviewed By: mbasmanova

Differential Revision: D55018054

Pulled By: Yuhta

fbshipit-source-id: 83dbab02575321930262628ea60cd4ad3e7b8892
  • Loading branch information
marin-ma authored and facebook-github-bot committed Mar 21, 2024
1 parent 86f12b7 commit 94f52ec
Show file tree
Hide file tree
Showing 3 changed files with 27 additions and 0 deletions.
1 change: 1 addition & 0 deletions velox/dwio/common/SelectiveColumnReader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ void SelectiveColumnReader::prepareNulls(
}
}
}
returnReaderNulls_ = false;
if (resultNulls_ && resultNulls_->unique() &&
resultNulls_->capacity() >= bits::nbytes(numRows) + simd::kPadding) {
// Clear whole capacity because future uses could hit
Expand Down
Binary file not shown.
26 changes: 26 additions & 0 deletions velox/dwio/parquet/tests/reader/ParquetReaderTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,32 @@ TEST_F(ParquetReaderTest, varcharFilters) {
"nation.parquet", rowType, std::move(filters), expected);
}

TEST_F(ParquetReaderTest, readDifferentEncodingsWithFilter) {
FilterMap filters;
filters.insert({"n_1", exec::equal(1)});
auto rowType = ROW({"n_0", "n_1", "n_2"}, {INTEGER(), INTEGER(), VARCHAR()});
auto expected = makeRowVector({
makeFlatVector<int32_t>({1, 2, 3, 4, 1, 2, 3, 4, 6, 9}),
makeFlatVector<int32_t>(10, [](auto /*row*/) { return 1; }),
makeNullableFlatVector<std::string>(
{"A",
"B",
std::nullopt,
std::nullopt,
"A",
"B",
std::nullopt,
std::nullopt,
"F",
std::nullopt}),
});
assertReadWithFilters(
"different_encodings_with_filter.parquet",
rowType,
std::move(filters),
expected);
}

// This test is to verify filterRowGroups() doesn't throw the fileOffset Velox
// check failure
TEST_F(ParquetReaderTest, filterRowGroups) {
Expand Down

0 comments on commit 94f52ec

Please sign in to comment.