Skip to content

Commit

Permalink
Optimize LIKE with custom escape char (#7730)
Browse files Browse the repository at this point in the history
Summary:
Currently we optimize LIKE operation only if escape char is not specified,
this PR adds the ability to apply the optimization even if user specifies
escape char. We introduced a PatternStringIterator which handles escaping
transparently, so existing optimizations(kPrefix, kSuffix, kSubstring etc)
now work for patterns with escape char transparently, and future
optimizations will have effect for escaped pattern transparently too.

The benchmark result before this optimization:

```
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   4.14s   241.44m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                     1.20s   833.70m
like_prefix##starts_with                                    2.92ms    342.44
like_substring##like_substring                               4.22s   236.77m
like_substring##strpos                                      6.98ms    143.27
like_suffix##like_suffix                                     3.09s   323.90m
like_suffix##ends_with                                      3.02ms    331.11
```

After:

```
============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   3.86s   258.97m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                    4.18ms    239.24
like_prefix##starts_with                                    2.76ms    362.05
like_substring##like_substring                              7.71ms    129.75
like_substring##strpos                                      6.67ms    149.90
like_suffix##like_suffix                                    4.20ms    237.85
like_suffix##ends_with                                      2.90ms    344.93
```

In Summary:

- Speedup of kSubstring is about 500x.
- Speedup of kPrefix is about 250x.
- Speedup of kSuffix is about 700x.

Why the speedup is so huge? There are two reasons:

- Re2 is really slow compare to the optimizations we made, even if the input string is short(10 byte), Re2 is 100x slower than our optimizations.
- When the input strings get longer(10bytes -> 1000bytes), the performance of our optimizations does not change much, but Re2's performance will be 10x slower.

And we can confirm the speedup is reasonable from the comparison between our
optmizations and the simple scalar function strpos, starts_with, ends_with, the
performance numbers are quite close(see the like##strpos/starts_with/ends_with
in the benchmark result for more details).

Pull Request resolved: #7730

Reviewed By: pedroerp

Differential Revision: D52077250

Pulled By: mbasmanova

fbshipit-source-id: 39703ddcc7f4f2044460d93866670f730e139120
  • Loading branch information
xumingming authored and facebook-github-bot committed Dec 13, 2023
1 parent ddde53b commit 1779e82
Show file tree
Hide file tree
Showing 6 changed files with 503 additions and 139 deletions.
9 changes: 7 additions & 2 deletions velox/benchmarks/basic/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,15 @@ add_executable(velox_benchmark_basic_preproc Preproc.cpp)
target_link_libraries(velox_benchmark_basic_preproc ${velox_benchmark_deps}
velox_functions_prestosql velox_vector_test_lib)

add_executable(velox_like_functions_benchmark LikeFunctionsBenchmark.cpp)
target_link_libraries(velox_like_functions_benchmark ${velox_benchmark_deps}
add_executable(velox_like_tpch_benchmark LikeTpchBenchmark.cpp)
target_link_libraries(velox_like_tpch_benchmark ${velox_benchmark_deps}
velox_functions_lib velox_tpch_gen velox_vector_test_lib)

add_executable(velox_like_benchmark LikeBenchmark.cpp)
target_link_libraries(
velox_like_benchmark ${velox_benchmark_deps} velox_functions_lib
velox_functions_prestosql velox_vector_test_lib)

add_executable(velox_benchmark_basic_vector_fuzzer VectorFuzzer.cpp)
target_link_libraries(velox_benchmark_basic_vector_fuzzer
${velox_benchmark_deps} velox_vector_test_lib)
Expand Down
96 changes: 96 additions & 0 deletions velox/benchmarks/basic/LikeBenchmark.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
/*
* Copyright (c) Facebook, Inc. and its affiliates.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <folly/Benchmark.h>
#include <folly/init/Init.h>

#include "velox/benchmarks/ExpressionBenchmarkBuilder.h"
#include "velox/functions/lib/Re2Functions.h"
#include "velox/functions/prestosql/registration/RegistrationFunctions.h"

using namespace facebook;
using namespace facebook::velox;
using namespace facebook::velox::functions;
using namespace facebook::velox::functions::test;
using namespace facebook::velox::memory;
using namespace facebook::velox;

int main(int argc, char** argv) {
folly::Init init(&argc, &argv);

exec::registerStatefulVectorFunction("like", likeSignatures(), makeLike);
// Register the scalar functions.
prestosql::registerAllScalarFunctions("");

// exec::register
ExpressionBenchmarkBuilder benchmarkBuilder;
const vector_size_t vectorSize = 1000;
auto vectorMaker = benchmarkBuilder.vectorMaker();

auto makeInput =
[&](vector_size_t vectorSize, bool padAtHead, bool padAtTail) {
return vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) {
// Strings in even rows contain/start with/end with a_b_c depends on
// value of padAtHead && padAtTail.
if (row % 2 == 0) {
auto padding = std::string(row / 2 + 1, 'x');
if (padAtHead && padAtTail) {
return fmt::format("{}a_b_c{}", padding, padding);
} else if (padAtHead) {
return fmt::format("{}a_b_c", padding);
} else if (padAtTail) {
return fmt::format("a_b_c{}", padding);
} else {
return std::string("a_b_c");
}
} else {
return std::string(row, 'x');
}
});
};

auto substringInput = makeInput(vectorSize, true, true);
auto prefixInput = makeInput(vectorSize, false, true);
auto suffixInput = makeInput(vectorSize, true, false);

benchmarkBuilder
.addBenchmarkSet(
"like_substring", vectorMaker.rowVector({"col0"}, {substringInput}))
.addExpression("like_substring", R"(like(col0, '%a\_b\_c%', '\'))")
.addExpression("strpos", R"(strpos(col0, 'a_b_c') > 0)");

benchmarkBuilder
.addBenchmarkSet(
"like_prefix", vectorMaker.rowVector({"col0"}, {prefixInput}))
.addExpression("like_prefix", R"(like(col0, 'a\_b\_c%', '\'))")
.addExpression("starts_with", R"(starts_with(col0, 'a_b_c'))");

benchmarkBuilder
.addBenchmarkSet(
"like_suffix", vectorMaker.rowVector({"col0"}, {suffixInput}))
.addExpression("like_suffix", R"(like(col0, '%a\_b\_c', '\'))")
.addExpression("ends_with", R"(ends_with(col0, 'a_b_c'))");

benchmarkBuilder
.addBenchmarkSet(
"like_generic", vectorMaker.rowVector({"col0"}, {substringInput}))
.addExpression("like_generic", R"(like(col0, '%a%b%c'))");

benchmarkBuilder.registerBenchmarks();
benchmarkBuilder.testBenchmarks();
folly::runBenchmarks();
return 0;
}
Loading

0 comments on commit 1779e82

Please sign in to comment.