-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add str_to_map Spark function #5842
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we describe the basic semantics of this function? We also need to add Spark's impl. to the PR description. Is there a impl. in presto sql, and what's the difference if there is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update relevant doc for this function? https://github.com/facebookincubator/velox/tree/main/velox/docs/functions/spark
const char* end = pos + current.size(); | ||
const char* pair; | ||
const char* kv; | ||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe adding some comments for below branches would be better.
205c654
to
8462b33
Compare
@@ -14,8 +14,18 @@ | |||
* limitations under the License. | |||
*/ | |||
|
|||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't include iostream
{{"a", "3"}, {"b", "2"}} | ||
}, | ||
}; | ||
// clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-format on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this part. Then no need to turn on/off clang-format now.
}; | ||
|
||
std::vector<std::shared_ptr<exec::FunctionSignature>> strToMapSignatures() { | ||
// varchar, varchar -> array(varchar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dot in the end of line?
@@ -154,6 +154,14 @@ Unless specified otherwise, all functions return NULL if at least one of the arg | |||
SELECT startswith('js SQL', 'SQL'); -- false | |||
SELECT startswith('js SQL', null); -- NULL | |||
|
|||
.. spark:function:: str_to_map(text[, pairDelim[, keyValueDelim]]) -> map(string, string) | |||
Create a map after splitting the ``text`` into key/value pairs using delimiters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the constraint for delimiter? Does ;;
supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current impl., the constraint is only single character is accepted as delimiter. So ;
is allowed.
std::vector<StringView>, | ||
std::vector<std::pair<StringView, std::optional<StringView>>>>>{ | ||
{ | ||
{"a:1,b:2,c:3"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test this delimiter?
spark-sql> SELECT str_to_map('a::1,b::2,c::3', ',', '::');
{"a":"1","b":"2","c":"3"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the test:
spark-sql> SELECT str_to_map('a::1,b::2,c::3', ',', 'c');
{"":"::3","a::1":null,"b::2":null}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added the 2nd test case. For the first one, as this PR just implements basic single-character delimiter, let's cover multi-character in separate PR. Thanks!
{"a:1_b:2_c:3", "_"}, | ||
{{"a", "1"}, {"b", "2"}, {"c", "3"}} | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this test suite? https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala#L699
@@ -134,10 +143,148 @@ std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() { | |||
.build()}; | |||
} | |||
|
|||
// str_to_map(expr [, pairDelim [, keyValueDelim] ] ) | |||
// Currently only supports single-character pairDelim & keyValueDelim. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbasmanova, I am following up this work.
I just noted that Spark accepts multi-character as delimiter for this function and the delimiter can also be regex pattern. But single-character delimiters are more common in the usage. Can we firstly propose this basic single-character impl.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE Sure.
b930240
to
d46f0bb
Compare
@rui-mo, please take a review further. Thanks! |
Update for PR description.
Just noted presto's split_to_map was implemented in velox two months ago. See link.
This part of code change is reverted as it is not required for current implementation. |
|
||
private: | ||
char entryDelim_; | ||
char keyValueDelim_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const char?
const std::vector<exec::VectorFunctionArg>& inputArgs, | ||
const core::QueryConfig& /*config*/) { | ||
char entryDelim; | ||
char keyValueDelim; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove these pre-declarations.
VELOX_USER_CHECK( | ||
inputArgs.size() == 3, "Expected 3 args for StringToMap function."); | ||
BaseVector* constantVector = inputArgs[1].constantValue.get(); | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_CHECK_NOT_NULL
constantVector != nullptr, | ||
"StringToMap function only allows constant entry delimiter."); | ||
auto constantStringView = | ||
constantVector->as<ConstantVector<StringView>>()->valueAt(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure the constant is not null. Same for the inputArgs[2].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you review! Fixed.
const std::vector<std::pair<StringView, std::optional<StringView>>>& | ||
expect) { | ||
std::string expr; | ||
expr = fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string expr = fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]);
resultWriter.init(*result->as<MapVector>()); | ||
|
||
folly::F14FastMap<StringView, vector_size_t> keyToIdx; | ||
rows.applyToSelected([&](vector_size_t row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use context.applyToSelectedNoThrow
instead.
entryEnd = std::find(pos, end, entryDelim_); | ||
keyEnd = std::find(pos, entryEnd, keyValueDelim_); | ||
auto key = StringView(pos, keyEnd - pos); | ||
auto iter = keyToIdx.find(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto?
keyEnd = std::find(pos, entryEnd, keyValueDelim_); | ||
auto key = StringView(pos, keyEnd - pos); | ||
auto iter = keyToIdx.find(key); | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_CHECK_EQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler complains type mismatched if we use it. So keep it unchanged.
const std::vector<exec::VectorFunctionArg>& inputArgs, | ||
const core::QueryConfig& /*config*/) { | ||
VELOX_USER_CHECK( | ||
inputArgs.size() == 3, "Expected 3 args for StringToMap function."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected -> Expects?
VELOX_USER_CHECK_NOT_NULL( | ||
constantVector, | ||
"StringToMap function requires constant entry/key-value delimiter."); | ||
auto constantStringView = constantVector->as<ConstantVector<StringView>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto?
"StringToMap function requires constant entry/key-value delimiter."); | ||
auto constantStringView = constantVector->as<ConstantVector<StringView>>(); | ||
if (constantStringView->isNullAt(0)) { | ||
return std::nullopt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it valid in Spark if this argument is null constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark returns null result if any input is null. Aligned.
exec::EvalCtx& context, | ||
VectorPtr& result) const override { | ||
exec::DecodedArgs decodedArgs(rows, args, context); | ||
DecodedVector* inputString = decodedArgs.at(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any fast path from constant vector and flat vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mbasmanova, do I need to add fast path optimization for constant/flat vector here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way to answer this question is to implement the fast paths and benchmark these against generic path. See velox/benchmarks/ExpressionBenchmarkBuilder.h for an easy way to write benchmarks for expressions.
testStringToMap({"", ",", "="}, {{"", std::nullopt}}); | ||
testStringToMap( | ||
{"a::1,b::2,c::3", ",", "c"}, | ||
{{"", "::3"}, {"a::1", std::nullopt}, {"b::2", std::nullopt}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we adjust the output order to align with input order for readability?
Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting | ||
each entry into key/value with ``keyValueDelim``. | ||
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws | ||
exception when duplicate map keys are found for single row's result, consistent with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add tests for cases that throw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already covered in test.
|
||
namespace facebook::velox::functions::sparksql { | ||
|
||
template <typename TExecCtx> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TExecCtx -> T
std::string_view entryDelimiter, | ||
std::string_view keyValueDelimiter) const { | ||
size_t pos = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this empty line?
expect) { | ||
std::vector<VectorPtr> row; | ||
row.emplace_back(makeFlatVector<StringView>({inputs[0]})); | ||
std::string expr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
row.emplace_back(makeFlatVector<StringView>({inputs[0]})); | ||
std::string expr = | ||
fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]); | ||
auto result = evaluate<MapVector>(expr, makeRowVector(row)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe makeRowVector(row) -> makeRowVector({makeFlatVector({input})}) and remove row.
{{"", "::3"}, {"a::1", std::nullopt}, {"b::2", std::nullopt}}); | ||
testStringToMap( | ||
{"a:1_b:2_c:3", "_", ":"}, {{"a", "1"}, {"b", "2"}, {"c", "3"}}); | ||
// Same delimiters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line above.
testStringToMap( | ||
{"a:1_b:2_c:3", "_", "_"}, | ||
{{"a:1", std::nullopt}, {"b:2", std::nullopt}, {"c:3", std::nullopt}}); | ||
// Exception for duplicated keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty line above.
TEST_F(StringToMapTest, Basics) { | ||
testStringToMap( | ||
{"a:1,b:2,c:3", ",", ":"}, {{"a", "1"}, {"b", "2"}, {"c", "3"}}); | ||
testStringToMap({"a: ,b:2", ",", ":"}, {{"a", " "}, {"b", "2"}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add test case for {"a:,b:2", ",", ":"}
.
Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting | ||
each entry into key/value with ``keyValueDelim``. | ||
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any single-character allowed to be entry and key-value delimeter in Spark? Do we have limitations for their content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not found the requirement in spark document. Assume there is no special requirement.
bb19e08
to
c8fe082
Compare
*/ | ||
#include "velox/common/base/tests/GTestUtils.h" | ||
#include "velox/functions/sparksql/tests/SparkFunctionBaseTest.h" | ||
#include "velox/type/Type.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type.h seems to be not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just fixed. Thanks!
Hi @mbasmanova, I just fixed all comments from Rui. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting | ||
each entry into key/value with ``keyValueDelim``. | ||
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: entryDelim
and keyValueDelim
must be constant single-character strings.
const arg_type<Varchar>& input, | ||
const arg_type<Varchar>& entryDelimiter, | ||
const arg_type<Varchar>& keyValueDelimiter) { | ||
VELOX_USER_CHECK(!entryDelimiter.empty(), "entryDelimiter is empty."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also check that their length is 1 character? Do we allow unicode characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should. Just updated. Actually, we only allow ascii character. Just clarified this in document.
std::string_view keyValueDelimiter, | ||
folly::F14FastSet<std::string_view>& keys) const { | ||
const auto delimiterPos = entry.find(keyValueDelimiter, 0); | ||
// Allows keyValue delimiter not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to describe this behavior in the documentation.
const auto key = std::string_view(entry.data(), delimiterPos); | ||
VELOX_USER_CHECK( | ||
keys.insert(key).second, | ||
"Duplicate keys are not allowed: ('{}').", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for both parens and quotes. Quotes should be sufficient.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
const arg_type<Varchar>& input, | ||
const arg_type<Varchar>& entryDelimiter, | ||
const arg_type<Varchar>& keyValueDelimiter) { | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_CHECK_EQ
} | ||
}; | ||
|
||
TEST_F(StringToMapTest, Basics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: basic
VELOX_ASSERT_THROW( | ||
evaluateStringToMap({"a:1,b:2,a:3", ",", ":"}), | ||
"Duplicate keys are not allowed: 'a'."); | ||
VELOX_ASSERT_THROW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also check for empty, non-ASCII, non-singlechar delimiters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added some test cases. Thanks!
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@laithsakka Laith, could you check these comments?
|
A bit clarification to avoid possible misleading. This mentioned change was included in the initial draft of Ma-Jian1 (the PR author). But I found it is not required at least for this PR actually. So the latest patch doesn't change this piece of code. As I have no permission to modify the PR description, these comments are not removed. @mbasmanova, maybe we can ignore this issue. |
I wonder why this is the case? Next time, if you find this problem, would you try creating your own PR rather than working on someoneelses? |
@mbasmanova merged this pull request in af7d1ef. |
Yes, makes sense. Thanks for your review! |
…ookincubator#5842) This commit integrates the community-provided implementation of str_to_map from PR facebookincubator#5842. The change addresses the issue where the previous in-house version of the function occasionally returned null in certain scenarios. Changes made: - Replaced the existing str_to_map function with the community version. - Conducted thorough testing to ensure compatibility and correct behavior. - Updated relevant documentation and comments to reflect the new source. The newly adopted implementation resolves the null-return bug and is expected to improve the reliability of the str_to_map utility. The community-driven approach also brings the additional benefit of shared maintenance and potential future enhancements.
Spark implementation.: https://github.com/apache/spark/blob/v3.2.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L531
It's similar to
split_to_map
in prestosql which is not implemented.