Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]parquet reader supports low cardinality optimization #55167

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zombee0
Copy link
Contributor

@zombee0 zombee0 commented Jan 16, 2025

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@zombee0 zombee0 requested review from a team as code owners January 16, 2025 12:07
@zombee0 zombee0 force-pushed the parquet_global_dict branch from 8ff2748 to a834e45 Compare January 16, 2025 13:17
@zombee0 zombee0 force-pushed the parquet_global_dict branch from a834e45 to 4227e55 Compare January 17, 2025 03:37

if (_tmp_column == nullptr) {
_tmp_column = ColumnHelper::create_column(TypeDescriptor::from_logical_type(TYPE_VARCHAR), true);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TYPE_VARCHAR_DESC instead of TypeDescriptor::from_logical_type(TYPE_VARCHAR) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do it.

nullable_dst->set_has_null(nullable_codes->has_null());
}

src->reset_column();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reset_column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src is used as temp column to store intermediate result, fill_dst_column means we have got a chunk of data, next time we call get_next, we will append data to it.

array_column_dst = down_cast<ArrayColumn*>(nullable_column_dst->mutable_data_column());
NullColumn* null_column_dst = nullable_column_dst->mutable_null_column();
null_column_dst->swap_column(*null_column_src);
nullable_column_src->update_has_null();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly swap the flag of src_nullable_column and dst_nullable_column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for data column, we fill it with reader

array_column_src = down_cast<ArrayColumn*>(src.get());
array_column_dst = down_cast<ArrayColumn*>(dst.get());
}
array_column_dst->offsets_column()->swap_column(*(array_column_src->offsets_column()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ArrayColumn::swap_column?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only swap offset and nullcolumn, for element_column, we fill it with element_reader

Signed-off-by: zombee0 <[email protected]>
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 175 / 195 (89.74%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/formats/parquet/complex_column_reader.cpp 18 23 78.26% [124, 125, 126, 127, 128]
🔵 be/src/connector/hive_connector.cpp 14 17 82.35% [594, 598, 599]
🔵 be/src/formats/parquet/scalar_column_reader.h 23 28 82.14% [189, 241, 257, 260, 261]
🔵 be/src/formats/parquet/column_reader_factory.cpp 14 15 93.33% [170]
🔵 be/src/formats/parquet/scalar_column_reader.cpp 96 102 94.12% [487, 491, 492, 589, 590, 591]
🔵 be/src/formats/parquet/complex_column_reader.h 1 1 100.00% []
🔵 be/src/formats/parquet/file_reader.cpp 2 2 100.00% []
🔵 be/src/exec/hdfs_scanner.cpp 3 3 100.00% []
🔵 be/src/formats/parquet/group_reader.cpp 4 4 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants