[Enhancement]parquet reader supports low cardinality optimization #55167

zombee0 · 2025-01-16T12:07:18Z

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Signed-off-by: zombee0 <[email protected]>

trueeyu · 2025-01-17T06:11:34Z

be/src/formats/parquet/scalar_column_reader.cpp

+
+    if (_tmp_column == nullptr) {
+        _tmp_column = ColumnHelper::create_column(TypeDescriptor::from_logical_type(TYPE_VARCHAR), true);
+    }


Use TYPE_VARCHAR_DESC instead of TypeDescriptor::from_logical_type(TYPE_VARCHAR) ?

will do it.

trueeyu · 2025-01-17T06:12:42Z

be/src/formats/parquet/scalar_column_reader.cpp

+        nullable_dst->set_has_null(nullable_codes->has_null());
+    }
+
+    src->reset_column();


Why reset_column?

src is used as temp column to store intermediate result, fill_dst_column means we have got a chunk of data, next time we call get_next, we will append data to it.

trueeyu · 2025-01-17T07:41:19Z

be/src/formats/parquet/complex_column_reader.cpp

+        array_column_dst = down_cast<ArrayColumn*>(nullable_column_dst->mutable_data_column());
+        NullColumn* null_column_dst = nullable_column_dst->mutable_null_column();
+        null_column_dst->swap_column(*null_column_src);
+        nullable_column_src->update_has_null();


Why not directly swap the flag of src_nullable_column and dst_nullable_column?

for data column, we fill it with reader

trueeyu · 2025-01-17T07:43:54Z

be/src/formats/parquet/complex_column_reader.cpp

+        array_column_src = down_cast<ArrayColumn*>(src.get());
+        array_column_dst = down_cast<ArrayColumn*>(dst.get());
+    }
+    array_column_dst->offsets_column()->swap_column(*(array_column_src->offsets_column()));


Why not use ArrayColumn::swap_column?

we only swap offset and nullcolumn, for element_column, we fill it with element_reader

Signed-off-by: zombee0 <[email protected]>

github-actions · 2025-01-17T11:12:37Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-01-17T11:12:41Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-01-17T11:23:14Z

[BE Incremental Coverage Report]

✅ pass : 175 / 195 (89.74%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/formats/parquet/complex_column_reader.cpp	18	23	78.26%	[124, 125, 126, 127, 128]
🔵	be/src/connector/hive_connector.cpp	14	17	82.35%	[594, 598, 599]
🔵	be/src/formats/parquet/scalar_column_reader.h	23	28	82.14%	[189, 241, 257, 260, 261]
🔵	be/src/formats/parquet/column_reader_factory.cpp	14	15	93.33%	[170]
🔵	be/src/formats/parquet/scalar_column_reader.cpp	96	102	94.12%	[487, 491, 492, 589, 590, 591]
🔵	be/src/formats/parquet/complex_column_reader.h	1	1	100.00%	[]
🔵	be/src/formats/parquet/file_reader.cpp	2	2	100.00%	[]
🔵	be/src/exec/hdfs_scanner.cpp	3	3	100.00%	[]
🔵	be/src/formats/parquet/group_reader.cpp	4	4	100.00%	[]

zombee0 requested review from a team as code owners January 16, 2025 12:07

mergify bot assigned zombee0 Jan 16, 2025

zombee0 force-pushed the parquet_global_dict branch from 8ff2748 to a834e45 Compare January 16, 2025 13:17

[Enhancement]parquet reader supports low cardinality optimization

4227e55

Signed-off-by: zombee0 <[email protected]>

zombee0 force-pushed the parquet_global_dict branch from a834e45 to 4227e55 Compare January 17, 2025 03:37

trueeyu reviewed Jan 17, 2025

View reviewed changes

address comment

d2fe948

Signed-off-by: zombee0 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]parquet reader supports low cardinality optimization #55167

[Enhancement]parquet reader supports low cardinality optimization #55167

zombee0 commented Jan 16, 2025

trueeyu Jan 17, 2025

zombee0 Jan 17, 2025

trueeyu Jan 17, 2025

zombee0 Jan 17, 2025

trueeyu Jan 17, 2025

zombee0 Jan 17, 2025

trueeyu Jan 17, 2025

zombee0 Jan 17, 2025

github-actions bot commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

[Enhancement]parquet reader supports low cardinality optimization #55167

Are you sure you want to change the base?

[Enhancement]parquet reader supports low cardinality optimization #55167

Conversation

zombee0 commented Jan 16, 2025

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 17, 2025

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Jan 17, 2025

[FE Incremental Coverage Report]

github-actions bot commented Jan 17, 2025

[BE Incremental Coverage Report]

file detail