-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize LIKE with custom escape char #7730
Optimize LIKE with custom escape char #7730
Conversation
✅ Deploy Preview for meta-velox canceled.
|
c4f1f23
to
0bb581d
Compare
@mbasmanova Do you have time review this one? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Thank you for working on this. I wonder if you'd be interested in helping with #7685 as well. Also, curious how much faster optimized version is? Would you add a benchmark or extend an existing one? See velox/benchmarks/ExpressionBenchmarkBuilder.h
@mbasmanova Seems I have some misunderstanding here, I thought this PR would fix #7685 , for pattern But anyway, support LIKE optimization when the escape char is specified is valuable, right? |
@xumingming James, we are seeing users write queries like this: Here, LIKE pattern appears as a prefix pattern, but when 'b' contains underscores the pattern is actually not a simple prefix patter. The user intent is of course for a prefix pattern, but we cannot assume that. The underscores happen often when a is a URL. Hence, it would be nice to optimize for prefix, suffix and substring patterns that contain underscores (wildchar).
Theoretically, yes, but I haven't seen these in practice. |
Added a benchmark to test the like optimization when escape char is specified, before this optimization:
after:
|
Hi @laithsakka @mbasmanova , can you help review this one? I will implement #7685 base on this PR, should be relatively easy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions.
velox/functions/lib/Re2Functions.h
Outdated
std::string fixedPattern = ""; | ||
// Contains the unescaped fixed pattern in patterns of kind kFixed, kPrefix, | ||
// kSuffix and kSubstring. | ||
std::string unescapedPattern = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could keep the original name 'fixedPattern'. The new name causes lots of changes that make it hard to review this PR and also is confusing to me as I need to think what does 'unescaped' mean.
velox/functions/lib/Re2Functions.h
Outdated
vector_size_t start, | ||
vector_size_t end, | ||
std::optional<char> escapeChar); | ||
std::string unescape(StringView pattern, std::optional<char> escapeChar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add empty line between functions
velox/functions/lib/Re2Functions.h
Outdated
|
||
/// Return the unescaped string for the specified string range, if escape char | ||
/// is not specified just return the corresponding substring. | ||
std::string unescape( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this function to .cpp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it here to facilitate testing, will move it.
velox/functions/lib/Re2Functions.h
Outdated
|
||
/// An Iterator that provides methods(hasNext, next) to iterate through a | ||
/// pattern string. | ||
class PatternStringIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this class to .cpp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it here to facilitate testing, will move it.
velox/functions/lib/Re2Functions.h
Outdated
} | ||
|
||
private: | ||
StringView pattern_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const here and for escapeChar_
velox/functions/lib/Re2Functions.cpp
Outdated
return true; | ||
}; | ||
|
||
while (iterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following this logic. Would you explain what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function determinePatternKind
has two parts:
- The first part collect stats about the pattern: e.g. where the first wildcard occurs, where the first fixedPattern occurs etc.
- The second part deduce the optimized pattern based on the stats collected in the first part.
The change I made is mainly to the first part, I introduced a dedicated struct: PatternStringIterator
, the reason is : in the previous code, we advance the cursor
by simply i++
, now to support escape char we have some state to update when advancing the cursor:
/// Advance the cursor to next char.
void next() {
currentIndex_++;
previousCharKind_ = charKind_;
auto currentChar = pattern_.data()[currentIndex_];
if (previousCharKind_ != CharKind::kEscape && currentChar == escapeChar_) {
charKind_ = CharKind::kEscape;
} else if (
previousCharKind_ != CharKind::kEscape && currentChar != escapeChar_ &&
(currentChar == '_' || currentChar == '%')) {
charKind_ = CharKind::kWildcard;
} else {
charKind_ = CharKind::kNormal;
}
}
Introducing PatternStringIterator
will make the logic easier to read & maintain.
velox/functions/lib/Re2Functions.cpp
Outdated
auto unescapedPattern = unescape(pattern, 0, patternLength, escapeChar); | ||
return PatternMetadata{ | ||
PatternKind::kFixed, | ||
(vector_size_t)unescapedPattern.size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vector_size_t should be used for row numbers and vector sizes, but not for size of a string. Why is it used here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently PatternMetadata::length
is of type vector_size_t
, will change it to size_t
.
velox/functions/lib/Re2Functions.cpp
Outdated
return PatternMetadata{ | ||
PatternKind::kSuffix, patternLength - fixedPatternStart}; | ||
PatternKind::kSuffix, | ||
(vector_size_t)unescapedPattern.size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pass both unescapedPattern and its size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, when we do not support escape char, we don't need to pass pattern
to PatternMetadata
, because we can deduce the suffix pattern
from the original pattern
and size
, take %ab
as an example:
pattern = %ab, size = 2 -> suffix_pattern = ab
now we support escape char, with only size, we can not deduce the actual suffix_pattern
, take %a\_b
as example(\
is the escape char), since the unescaped pattern is a_b
, the size is 3
, but if we get the last 3 letters from the original pattern, it is \_b
, which is wrong, so we need to pass both unescaped pattern and size.
// Prepare data which contains/prefix with/suffix with the string 'a_b_c' | ||
for (int i = 0; i < vectorSize; i++) { | ||
substringInput->set( | ||
i, StringView::makeInline(fmt::format("{}a_b_c{}", i, i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark shows 40x speed up. Is this right? I noticed you use only inline strings. Will results change at all if strings are longer?
.addExpression("like_prefix", R"(like (col1, 'a\_b\_c%', '\'))") | ||
.addExpression("like_suffix", R"(like (col2, '%a\_b\_c', '\'))") | ||
.addExpression("like_generic", R"(like (col3, '%a%b%c'))") | ||
.withIterations(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
Addressed all review comments, updated the benchmark to use longer strings, will paste the new benchmark result later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some follow-up questions.
velox/functions/lib/Re2Functions.cpp
Outdated
@@ -483,7 +483,7 @@ class OptimizedLike final : public VectorFunction { | |||
} | |||
|
|||
private: | |||
StringView pattern_; | |||
std::string pattern_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
velox/functions/lib/Re2Functions.cpp
Outdated
@@ -427,12 +427,12 @@ bool matchSubstringPattern( | |||
template <PatternKind P> | |||
class OptimizedLike final : public VectorFunction { | |||
public: | |||
OptimizedLike(StringView pattern, vector_size_t reducedPatternLength) | |||
OptimizedLike(std::string pattern, vector_size_t reducedPatternLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::move to avoid extra copying
pattern_{std::move(pattern)},
velox/functions/lib/Re2Functions.cpp
Outdated
} | ||
PatternMetadata patternMetadata = | ||
determinePatternKind(pattern, escapeChar); | ||
vector_size_t reducedLength = patternMetadata.length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto
velox/functions/lib/Re2Functions.cpp
Outdated
PatternMetadata patternMetadata = | ||
determinePatternKind(pattern, escapeChar); | ||
vector_size_t reducedLength = patternMetadata.length; | ||
auto fixedPattern = patternMetadata.fixedPattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto& to avoid extra copy
velox/functions/lib/Re2Functions.cpp
Outdated
|
||
std::ostringstream os; | ||
// Append the specified range to the stream. | ||
auto appendNormalCharsFn = [&](const char* start, const char* end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appendNormalCharsFn -> appendChars (no need for "normal" and "fn")
velox/functions/lib/Re2Functions.cpp
Outdated
os << current; | ||
} else { | ||
// Escape char not found, append all the non-escape chars. | ||
appendNormalCharsFn(previous, cursor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, use endCursor add 'break' here for readability
appendChars(previous, endCursor);
break;
velox/functions/lib/Re2Functions.cpp
Outdated
if (iterator.previousState() == PatternStringIterator::CharKind::kEscape) { | ||
// The char follows escapeChar can only be one of (%, _, escapeChar). | ||
if (current != '%' && current != '_' && current != escapeChar) { | ||
fallbackToGeneric = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reach here, does it mean that the pattern is not valid? If so, should we fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but seems we can not throw exception here, exception need to be set to the context? which we don't have access to here, so here we will fallback to generic and let generic implementation handle the exception.
auto error = std::make_exception_ptr(std::invalid_argument(
"Escape character must be followed by '%', '_' or the escape character itself"));
context.setErrors(rows, error);
velox/functions/lib/Re2Functions.cpp
Outdated
iterator.next(); | ||
auto current = iterator.current(); | ||
|
||
if (iterator.previousState() == PatternStringIterator::CharKind::kEscape) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator abstraction is supposed to hide the details of how escape characters are processed. I'd expect it to allow us to write code as if we don't know anything about escape characters. However, I see that we are handling escape chars both in the iterator and in the code that uses the iterator. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my initial design, PatternStringIterator
does not encapsulate the escape
state, but I do agree with you that let it encapsulate escape
state will make its user much easier.
before:
after:
|
Hi @mbasmanova , can you help review again? comments addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Thank you for iterating on this PR. I'm still a bit confused about the code and have some suggestions to refactor for clarity.
velox/functions/lib/Re2Functions.cpp
Outdated
std::ostringstream os; | ||
// Append the specified range to the stream. | ||
auto appendChars = [&](const char* start, const char* end) { | ||
for (auto it = start; it < end; it++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not os.write(start, end - start)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
velox/functions/lib/Re2Functions.cpp
Outdated
CharKind userPreviousCharKind_ = CharKind::kNormal; | ||
|
||
/// Advance the cursor to next char. | ||
void nextInternal() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move methods before member variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
velox/functions/lib/Re2Functions.cpp
Outdated
// previousCharKind_ is kEscape, while userPreviousCharKind_ is kNormal. | ||
CharKind userPreviousCharKind_ = CharKind::kNormal; | ||
|
||
/// Advance the cursor to next char. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use // for comments in .cpp files
use /// for comments on public methods in .h files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
velox/functions/lib/Re2Functions.cpp
Outdated
size_t singleCharacterWildcardCount = 0; | ||
|
||
PatternStringIterator iterator{pattern, escapeChar}; | ||
bool fallbackToGeneric = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of falling back to 'generic', we should return exec::AlwaysFailingVectorFunction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current function determinePatternKind returns PatternKind, it has no knowledge about AlwaysFailingVectorFunction?
velox/functions/lib/Re2Functions.cpp
Outdated
// Kind of current char. | ||
CharKind charKind_ = CharKind::kNormal; | ||
// Kind of previous char(literally). | ||
CharKind previousCharKind_ = CharKind::kNormal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove this member variable. See https://github.com/facebookincubator/velox/compare/main...mbasmanova:velox-1:optimize_like_when_user_specified_escape_char?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link seems broken, we need previousCharKind_
to track escape state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Can you try #7933 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Applied your change.
velox/functions/lib/Re2Functions.cpp
Outdated
// The difference between previousCharKind_ and realPreviousCharKind_ can be | ||
// described by an example: for pattern string 'a\_b', if the cursor is at '_' | ||
// previousCharKind_ is kEscape, while userPreviousCharKind_ is kNormal. | ||
CharKind userPreviousCharKind_ = CharKind::kNormal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we can simplify some more. We can replace this with a bool isPreviousWildcard_; remove nextInternal; and remove CharKind::kEscape.
See last commit in #7933
|
||
benchmarkBuilder | ||
.addBenchmarkSet( | ||
"like", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the file to LikeBenchmark.cpp so we can add benchmarks for like without escape char in a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case you didn't notice that there is another LikeFunctionsBenchmark.cpp
, but I find it not easy to change it to add like + escape
benchmark into it, so I started this new benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the pointer. I didn't realize that. Looks like that other benchmark is targeting LIKE patterns found in TPC-H benchmark. Maybe we could rename it to LikeTpchBenchmark and use LikeBenchmark name for the new more generic benchmark you have started. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, renamed.
return StringView(temp); | ||
}, | ||
nullptr); | ||
auto suffixInput = vectorMaker.flatVector<facebook::velox::StringView>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use flatVectorstd::string
I’m not sure if there is something wrong with the design of the benchmark, the improvement seems huge. Before(goes
After(goes corresponding optimization):
|
return std::string("x", row); | ||
} | ||
}, | ||
nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to pass nullptr as it is the default value
.addExpression("prefix", R"(like (col1, 'a\_b\_c%', '\'))") | ||
.addExpression("suffix", R"(like (col2, '%a\_b\_c', '\'))") | ||
.addExpression("generic", R"(like (col0, '%a%b%c'))") | ||
.withIterations(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no reason, this statement can be removed to use the default value: 1000.
@mbasmanova Do you have further comments? |
@xumingming James, have you been able to resolve the below concern?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xumingming Looks great % a few nits. Please, update PR description to explain all the changes in this PR and benchmark results.
|
||
auto substringInput = | ||
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) { | ||
// Only when the number is even we make a string contains a substring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // Strings in even rows contain a_b_c.
|
||
auto prefixInput = | ||
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) { | ||
// Only when the number is even we make a string starts with a_b_c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // Strings in even rows start with a_b_c
|
||
auto suffixInput = | ||
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) { | ||
// Only when the number is even we make a string ends with a_b_c. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: // Strings in even rows end with a_b_c.
const vector_size_t vectorSize = 1000; | ||
auto vectorMaker = benchmarkBuilder.vectorMaker(); | ||
|
||
auto substringInput = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a good amount of repetition here. Consider extracting helper function:
auto substringInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return ...a_b_c...;});
auto prefixInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return a_b_c...;});
auto suffixInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return ...a_b_c;});
or use a pair of booleans to indicate whether to pad strings at the beginning and at end.
auto substringInput = makeInput(vectorMaker, vectorSize, true, true);
auto prefixInput = makeInput(vectorMaker, vectorSize, false, true);
auto suffixInput = makeInput(vectorMaker, vectorSize, true, false);
velox/functions/lib/Re2Functions.cpp
Outdated
return os.str(); | ||
} | ||
|
||
// An iterator to iterate through a pattern string. It will handle escaping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "It will handle" -> "Handles escaping transparently"
or
// Iterates through a pattern string. Transparently handles escape sequences.
escapeChar_(escapeChar), | ||
lastIndex_{pattern_.size() - 1} {} | ||
|
||
// Advance the cursor to next char, escape char is automatically handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps, clarify that this method returns 'false' if at end.
return pattern_.data()[currentIndex_]; | ||
} | ||
|
||
const StringView pattern_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we'll have to take another pass, maybe in a follow-up PR, to replace StringView with std::string_view. StringView is really meant to store strings in vectors and is not a replacement for light-weight std::string_view. For example, for short strings we end up copying strings when passing StringView by value and for these strings it is not safe to store a pointer to data().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I will submit a dedicated a PR for StringView change.
velox/functions/lib/Re2Functions.cpp
Outdated
const size_t lastIndex_; | ||
|
||
int32_t currentIndex_ = -1; | ||
CharKind charKind_ = CharKind::kNormal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use {} consistently
@mbasmanova After some experiments, I think there are two reasons for the huge performance improvement:
And we can confirm the speedup is reasonable from the comparison between our optmizations and the simple scalar function strpos, starts_with, ends_with(as you suggested, really good idea :)), the performance nums are quite close(see the PR description for more details). |
@xumingming Thank you for digging in and sharing your findings. These make sense. Let's add this context to the PR description to make it easily available for future readers. Awesome job! |
.addExpression("like_suffix", R"(like(col2, '%a\_b\_c', '\'))") | ||
.addExpression("like_generic", R"(like(col0, '%a%b%c'))") | ||
.addExpression("strpos", R"(strpos(col0, 'a_b_c'))") | ||
.addExpression("starts_with", R"(starts_with(col1, 'a_b_c'))") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how does this work? Don't we need to split these benchmarks into 3 different sets?
// This class represents a set of expressions to benchmark those expressions
// uses the same input vector type, and are expected to have the same result
// if testing is not disabled for the set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CastBenchmark.cpp
has similar usage? is the comment you refered to is out of date? (Have not look into the implementation details of ExpressionBenchmarkSet
yet.)
benchmarkBuilder
.addBenchmarkSet(
"cast",
vectorMaker.rowVector(
{"valid",
"empty",
"nan",
"decimal",
"short_decimal",
"long_decimal",
"timestamp"},
{validInput,
invalidInput,
nanInput,
decimalInput,
shortDecimalInput,
longDecimalInput,
timestampInput}))
.addExpression("try_cast_invalid_empty_input", "try_cast (empty as int) ")
.addExpression(
"tryexpr_cast_invalid_empty_input", "try (cast (empty as int))")
.addExpression("try_cast_invalid_nan", "try_cast (nan as int)")
.addExpression("tryexpr_cast_invalid_nan", "try (cast (nan as int))")
.addExpression("try_cast_valid", "try_cast (valid as int)")
.addExpression("tryexpr_cast_valid", "try (cast (valid as int))")
.addExpression("cast_valid", "cast(valid as int)")
.addExpression(
"cast_decimal_to_inline_string", "cast (decimal as varchar)")
.addExpression("cast_short_decimal", "cast (short_decimal as varchar)")
.addExpression("cast_long_decimal", "cast (long_decimal as varchar)")
.addExpression("cast_timestamp", "cast (timestamp as varchar)")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CastBenchmark has .disableTesting();
CC: @laithsakka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova After digging into the details of ExpressionBenchmarkBuilder, I find my understanding of it was not quite right, updated the benchmark.
Updated the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@xumingming Would you rebase this PR so I can try to merge? |
Currently we have optimization for LIKE only when user does not specify escape char, this commit provides optimization(kPrefix, kSuffix, kFixed, kSubstring) when user specified escape char.
293c487
to
0d17015
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
// Strings in even rows contain/start with/end with a_b_c depends on | ||
// value of padAtHead && padAtTail. | ||
if (row % 2 == 0) { | ||
auto padding = std::string("x", row / 2 + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be std::string (n, 'x')
I'll fix. Just FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Sorry for this issue, I re-runed the benchmark and update the benchmark results in the PR description, now the speedup is about 10x to 20x, more resonable :)
return std::string("a_b_c"); | ||
} | ||
} else { | ||
return std::string("x", row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
FYI, I get the following results on the benchmark
|
size_t end, | ||
std::optional<char> escapeChar) { | ||
if (!escapeChar) { | ||
return std::string(pattern.data(), start, end - start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be std::string(pattern.data() + start, end - start); Otherwise, lots of ASAN failures.
=================================================================
==293875==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x611000e5b4c0 at pc 0x000000319269 bp 0x7ffff14e7e60 sp 0x7ffff14e7620
READ of size 129 at 0x611000e5b4c0 thread T0
SCARINESS: 26 (multi-byte-read-heap-buffer-overflow)
#0 0x319268 in __interceptor_strlen.part.0 ubsan.c
#1 0x7f069dbf009d in std::char_traits<char>::length(char const*) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/char_traits.h:371
#2 0x7f069dbeff08 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::basic_string<std::allocator<char>>(char const*, std::allocator<char> const&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/basic_string.h:536
#3 0x7f069ddac65c in facebook::velox::functions::unescape[abi:cxx11](facebook::velox::StringView, unsigned long, unsigned long, std::optional<char>) fbcode/velox/functions/lib/Re2Functions.cpp:979
#4 0x7f069ddaf13f in facebook::velox::functions::determinePatternKind(facebook::velox::StringView, std::optional<char>) fbcode/velox/functions/lib/Re2Functions.cpp:1203
#5 0x7f069ddbca05 in facebook::velox::functions::makeLike(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<facebook::velox::exec::VectorFunctionArg, std::allocator<facebook::velox::exec::VectorFunctionArg>> const&, facebook::velox::core::QueryConfig const&) fbcode/velox/functions/lib/Re2Functions.cpp:1278
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this warning a false positive or the original expression is indeed wrong?
@mbasmanova merged this pull request in 1779e82. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…#8040) Summary: Expression: ```cpp like(R"(abcdef\abcdef)", R"(%\\abc%)", '\\') ``` Error: ``` Error Source: USER Error Code: INVALID_ARGUMENT Reason: Escape character must be followed by '%', '_' or the escape character itself Retriable: False Expression: current == escapeChar || current == '_' || current == '%' Function: unescape File: ../../velox/functions/lib/Re2Functions.cpp ``` The bug is a regression caused by #7730 Pull Request resolved: #8040 Reviewed By: Yuhta Differential Revision: D52160961 Pulled By: mbasmanova fbshipit-source-id: 9a777222ee0dc3be9ac0323ee5ca47ab95016f63
Currently we optimize LIKE operation only if escape char is not specified,
this PR adds the ability to apply the optimization even if user specifies
escape char. We introduced a PatternStringIterator which handles escaping
transparently, so existing optimizations(kPrefix, kSuffix, kSubstring etc)
now work for patterns with escape char transparently, and future
optimizations will have effect for escaped pattern transparently too.
The benchmark result before this optimization:
After:
In Summary:
We can confirm the speedup is reasonable from the comparison between our
optmizations and the simple scalar function strpos, starts_with, ends_with, the
performance numbers are quite close(see the like##strpos/starts_with/ends_with
in the benchmark result for more details).