-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize scalar string functions #14833
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14833 +/- ##
=============================================
- Coverage 61.75% 34.03% -27.72%
- Complexity 207 669 +462
=============================================
Files 2436 2720 +284
Lines 133233 151994 +18761
Branches 20636 23470 +2834
=============================================
- Hits 82274 51738 -30536
- Misses 44911 96085 +51174
+ Partials 6048 4171 -1877
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I suggested that Bolek introduce new functions to compare performance differences easily. At the same time, it is not clear to us if users could run these queries with arguments that were not constant. SSE has a list of functions whose arguments must be literal, but it isn't easy to know if these are all. @Jackie-Jiang, what do you think? Should we replace the older functions with the faster implementation suggested by Bolek? |
I've checked calcite code and it seems there are at least two options:
|
* @return trim spaces from left side of the string | ||
*/ | ||
@ScalarFunction | ||
public static String ltrim(String input) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the new implementation of LTRIM better than this implementation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation shares the matcher, instead of allocating a new one per call.
When we say "constant", are we referring to the (1) ability to distinguish an identifier vs literal OR (2) multiple invocations of the function with the same input which helps save the initialization cost ? |
I don't think there is a need to introduce separate functions for CONSTANT flavor. We should go down the polymorphism route. |
By 'constant' I meant that function is called with literal as the pattern argument, and not e.g. column name. I think we could either implement it like polymorphic functions or add a init()/newInstance() method that'd allow choosing implementations based on actual arguments and get rid of conditional initialization in method body. |
…ctions # Conflicts: # pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkQueries.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but please upgrade the PR description. Given clear performance increase and the fact that these functions must be used with literal arguments in SSE, we decided to change the implementation instead of creating the _const
alternative
PR optimizes a number of string scalar functions, including:
, changes other functions assuming that pattern is constant:
while keeping existing generic implementations with _var suffix:
All of the functions mentioned above have been changed to initialize temporary objects and clear/reuse them in each call.
As can be seen in the following benchmark output, this change can speed up a raw function call even 4+ times.
If query processing is dominated by function call then effect on actual query performance is similar:
NOTE: the reason I added _const function is that currently there is no way for engine to choose implementation based on function argument being constant or variable. If we change, e.g. regexp_replace, it will start returning wrong results if regular expression is variable, without raising an error or warning.