-
Notifications
You must be signed in to change notification settings - Fork 3.4k
[enchement](orc) improve the read amplification problem caused by orc tiny stripe optimization. #50675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
2801c8d
to
3befcf5
Compare
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33873 ms
|
TPC-DS: Total hot run time: 193233 ms
|
ClickBench: Total hot run time: 29.83 s
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
range_finder); | ||
//The first stripe in the read file is selected to calculate an approximate | ||
// read amplification factor using the tiny stripe optimization. | ||
std::vector<bool> selectedColumns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::vector<bool> selectedColumns; | |
std::vector<bool> selected_columns; |
3befcf5
to
c3d4c9b
Compare
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33616 ms
|
TPC-DS: Total hot run time: 192540 ms
|
ClickBench: Total hot run time: 29.62 s
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33939 ms
|
TPC-DS: Total hot run time: 192989 ms
|
ClickBench: Total hot run time: 29.71 s
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33607 ms
|
TPC-DS: Total hot run time: 192368 ms
|
ClickBench: Total hot run time: 29.7 s
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… tiny stripe optimization.
4516147
to
41221b8
Compare
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 33799 ms
|
TPC-DS: Total hot run time: 194098 ms
|
ClickBench: Total hot run time: 29.86 s
|
What problem does this PR solve?
Related PR: #42004
orc submodule pr: apache/doris-thirdparty#313
Problem Summary:
In the previous PR #42004 , the optimization of orc tiny stripe was introduced. However, if there are many columns in the orc file and only a few of them are used in a query, the tiny stripe optimization will cause serious read amplification.
This PR introduces the
orc_tiny_stripe_amplification_factor
session variable to adjust the usage scenario of orc tiny stripe optimization. When the actual number of bytes to be read accounts for a larger proportion of the entire stripe than this parameter, read optimization is used. The default value of this parameter is 0.4, and the minimum value is 0.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)