Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-4878 optimise sub-select #4879

Merged
merged 4 commits into from
Jan 25, 2024
Merged

GH-4878 optimise sub-select #4879

merged 4 commits into from
Jan 25, 2024

Conversation

hmottestad
Copy link
Contributor

@hmottestad hmottestad commented Jan 23, 2024

GitHub issue resolved: #4878

Briefly describe the changes proposed in this PR:


PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits where necessary
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change

@hmottestad hmottestad force-pushed the GH-4878-optimise-sub-select branch from 909870f to a8b468f Compare January 23, 2024 15:11
@hmottestad hmottestad force-pushed the GH-4878-optimise-sub-select branch from a8b468f to c1d042b Compare January 23, 2024 21:05
@hmottestad
Copy link
Contributor Author

hmottestad commented Jan 23, 2024

Develop branch

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    1.026 ±  0.024  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    0.475 ±  0.003  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.591 ±  0.002  ms/op
QueryBenchmark.long_chain                                     avgt    5  165.798 ±  2.056  ms/op
QueryBenchmark.lots_of_optional                               avgt    5   42.455 ±  0.284  ms/op
QueryBenchmark.minus                                          avgt    5  896.323 ± 26.089  ms/op
QueryBenchmark.nested_optionals                               avgt    5   62.486 ±  0.219  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    5.174 ±  0.081  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    0.491 ±  0.003  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   51.896 ±  0.587  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    2.072 ±  1.333  ms/op

This branch

Benchmark                                                     Mode  Cnt    Score    Error  Units
QueryBenchmark.complexQuery                                   avgt    5    1.061 ±  0.004  ms/op
QueryBenchmark.different_datasets_with_similar_distributions  avgt    5    0.472 ±  0.002  ms/op
QueryBenchmark.groupByQuery                                   avgt    5    0.606 ±  0.003  ms/op
QueryBenchmark.long_chain                                     avgt    5  172.897 ± 18.938  ms/op
QueryBenchmark.lots_of_optional                               avgt    5   44.313 ±  3.020  ms/op
QueryBenchmark.minus                                          avgt    5  914.117 ± 23.632  ms/op
QueryBenchmark.nested_optionals                               avgt    5   64.688 ±  2.071  ms/op
QueryBenchmark.pathExpressionQuery1                           avgt    5    5.386 ±  0.596  ms/op
QueryBenchmark.pathExpressionQuery2                           avgt    5    0.478 ±  0.027  ms/op
QueryBenchmark.query_distinct_predicates                      avgt    5   53.481 ±  1.733  ms/op
QueryBenchmark.simple_filter_not                              avgt    5    1.878 ±  0.519  ms/op

@hmottestad
Copy link
Contributor Author

@JervenBolleman Any chance you could take a look at this PR? I made a test, but I didn't make a benchmark query yet. The benchmark run I did shows that there at least doesn't seem to be any performance degradation.

Copy link
Contributor

@JervenBolleman JervenBolleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good improvement and worth committing even without a benchmark query. A one of diff test is fine by me.

+ LINE_SEP,
q.getTupleExpr().toString());
assertThat(q.getTupleExpr().toString()).isEqualToNormalizingNewlines("QueryRoot\n" +
" Projection\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not have + LINE_SEP instead of \n to be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isEqualToNormalizingNewlines fixes the new line seperators for us

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But might change it back actually, since it's the only place that tests the line separation aspect of the query plan.

@@ -170,16 +170,16 @@ public void setTotalTimeNanosActual(long totalTimeNanosActual) {
/**
* @return Human readable number. Eg. 12.1M for 1212213.4 and UNKNOWN for -1.
*/
static String toHumanReadbleNumber(double number) {
static String toHumanReadableNumber(double number) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this typo. Should we extract this into an utility at somepoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be moved to a more common module, but I don't think I want to do that now.

+
" │ o: Var (name=d)\n" +
" └── StatementPattern (resultSizeActual=2) [right]\n" +
" │ ╚══ LeftJoin (new scope) (BadlyDesignedLeftJoinIterator) (costEstimate=6.61, resultSizeEstimate=12, resultSizeActual=4) [right]\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice improvement in the query plan.

@hmottestad hmottestad force-pushed the GH-4878-optimise-sub-select branch from 5a31ca3 to c0aedf9 Compare January 24, 2024 22:17
@hmottestad
Copy link
Contributor Author

@JervenBolleman I actually found out that the QueryJoinOptimizer is able to optimise sub-selects, but not if there are multiple sub-selects or if there are any BIND clauses anywhere. I've made two benchmarks that will make sure that we don't break that optimisation later.

@hmottestad hmottestad enabled auto-merge (squash) January 25, 2024 10:55
@hmottestad hmottestad merged commit 5f67425 into develop Jan 25, 2024
8 checks passed
@hmottestad hmottestad deleted the GH-4878-optimise-sub-select branch January 25, 2024 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants