Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add str_to_map Spark function #5842

Closed
wants to merge 4 commits into from

Conversation

majian4work
Copy link
Contributor

@majian4work majian4work commented Jul 26, 2023

@netlify
Copy link

netlify bot commented Jul 26, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit e3d4647
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6570323f39537f0009ed6c84

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 26, 2023
@majian4work
Copy link
Contributor Author

@rui-mo

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe the basic semantics of this function? We also need to add Spark's impl. to the PR description. Is there a impl. in presto sql, and what's the difference if there is?

velox/functions/sparksql/SplitFunctions.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/SplitFunctions.cpp Outdated Show resolved Hide resolved
velox/functions/sparksql/tests/SplitFunctionsTest.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox/functions/sparksql/SplitFunctions.cpp Outdated Show resolved Hide resolved
const char* end = pos + current.size();
const char* pair;
const char* kv;
do {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe adding some comments for below branches would be better.

velox/expression/ComplexWriterTypes.h Outdated Show resolved Hide resolved
@majian4work majian4work force-pushed the str_to_map branch 4 times, most recently from 205c654 to 8462b33 Compare August 1, 2023 02:13
@majian4work majian4work changed the title support spark str_to_map Add str_to_map Spark function Aug 8, 2023
@@ -14,8 +14,18 @@
* limitations under the License.
*/

#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include iostream

{{"a", "3"}, {"b", "2"}}
},
};
// clang-format off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this part. Then no need to turn on/off clang-format now.

};

std::vector<std::shared_ptr<exec::FunctionSignature>> strToMapSignatures() {
// varchar, varchar -> array(varchar)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dot in the end of line?

@@ -154,6 +154,14 @@ Unless specified otherwise, all functions return NULL if at least one of the arg
SELECT startswith('js SQL', 'SQL'); -- false
SELECT startswith('js SQL', null); -- NULL

.. spark:function:: str_to_map(text[, pairDelim[, keyValueDelim]]) -> map(string, string)
Create a map after splitting the ``text`` into key/value pairs using delimiters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the constraint for delimiter? Does ;; supported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current impl., the constraint is only single character is accepted as delimiter. So ; is allowed.

std::vector<StringView>,
std::vector<std::pair<StringView, std::optional<StringView>>>>>{
{
{"a:1,b:2,c:3"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test this delimiter?

spark-sql> SELECT str_to_map('a::1,b::2,c::3', ',', '::');
{"a":"1","b":"2","c":"3"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the test:

spark-sql> SELECT str_to_map('a::1,b::2,c::3', ',', 'c');
{"":"::3","a::1":null,"b::2":null}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added the 2nd test case. For the first one, as this PR just implements basic single-character delimiter, let's cover multi-character in separate PR. Thanks!

{"a:1_b:2_c:3", "_"},
{{"a", "1"}, {"b", "2"}, {"c", "3"}}
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -134,10 +143,148 @@ std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {
.build()};
}

// str_to_map(expr [, pairDelim [, keyValueDelim] ] )
// Currently only supports single-character pairDelim & keyValueDelim.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbasmanova, I am following up this work.
I just noted that Spark accepts multi-character as delimiter for this function and the delimiter can also be regex pattern. But single-character delimiters are more common in the usage. Can we firstly propose this basic single-character impl.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE Sure.

@PHILO-HE PHILO-HE force-pushed the str_to_map branch 3 times, most recently from b930240 to d46f0bb Compare November 9, 2023 05:38
@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 9, 2023

@rui-mo, please take a review further. Thanks!

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Nov 9, 2023

Update for PR description.

Spark implementation.: https://github.com/apache/spark/blob/v3.2.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L531

It's similar to split_to_map in prestosql which is not implemented.

Just noted presto's split_to_map was implemented in velox two months ago. See link.
We are still proposing this separate impl. for spark due to there are semantic differences between spark and presto.
Known differences: 1) Spark allows same entry/key-value delimiter, but presto not. 2) Spark allows regexp pattern as delimiter (can be implemented based on this PR), but presto not.

I'm not sure why there's a need to check std_interface in operator[] for K=Varchar, just comment it out for overwrite value in the map. The emplace function also seems to have the same issue for K=Varchar, but I've found that I can get around it by using add_item or add_null. Overall, it seems that the behavior is not very consistent.

  std::tuple<PrimitiveWriter<K, false>, PrimitiveWriter<V>> operator[](
      vector_size_t index) {
    // static_assert(std_interface, "operator [] not allowed for this map");

This part of code change is reverted as it is not required for current implementation.


private:
char entryDelim_;
char keyValueDelim_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const char?

const std::vector<exec::VectorFunctionArg>& inputArgs,
const core::QueryConfig& /*config*/) {
char entryDelim;
char keyValueDelim;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove these pre-declarations.

VELOX_USER_CHECK(
inputArgs.size() == 3, "Expected 3 args for StringToMap function.");
BaseVector* constantVector = inputArgs[1].constantValue.get();
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_CHECK_NOT_NULL

constantVector != nullptr,
"StringToMap function only allows constant entry delimiter.");
auto constantStringView =
constantVector->as<ConstantVector<StringView>>()->valueAt(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure the constant is not null. Same for the inputArgs[2].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you review! Fixed.

const std::vector<std::pair<StringView, std::optional<StringView>>>&
expect) {
std::string expr;
expr = fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::string expr = fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]);

resultWriter.init(*result->as<MapVector>());

folly::F14FastMap<StringView, vector_size_t> keyToIdx;
rows.applyToSelected([&](vector_size_t row) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use context.applyToSelectedNoThrow instead.

entryEnd = std::find(pos, end, entryDelim_);
keyEnd = std::find(pos, entryEnd, keyValueDelim_);
auto key = StringView(pos, keyEnd - pos);
auto iter = keyToIdx.find(key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto?

keyEnd = std::find(pos, entryEnd, keyValueDelim_);
auto key = StringView(pos, keyEnd - pos);
auto iter = keyToIdx.find(key);
VELOX_USER_CHECK(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_CHECK_EQ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler complains type mismatched if we use it. So keep it unchanged.

const std::vector<exec::VectorFunctionArg>& inputArgs,
const core::QueryConfig& /*config*/) {
VELOX_USER_CHECK(
inputArgs.size() == 3, "Expected 3 args for StringToMap function.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected -> Expects?

VELOX_USER_CHECK_NOT_NULL(
constantVector,
"StringToMap function requires constant entry/key-value delimiter.");
auto constantStringView = constantVector->as<ConstantVector<StringView>>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto?

"StringToMap function requires constant entry/key-value delimiter.");
auto constantStringView = constantVector->as<ConstantVector<StringView>>();
if (constantStringView->isNullAt(0)) {
return std::nullopt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it valid in Spark if this argument is null constant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark returns null result if any input is null. Aligned.

exec::EvalCtx& context,
VectorPtr& result) const override {
exec::DecodedArgs decodedArgs(rows, args, context);
DecodedVector* inputString = decodedArgs.at(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any fast path from constant vector and flat vector?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mbasmanova, do I need to add fast path optimization for constant/flat vector here? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to answer this question is to implement the fast paths and benchmark these against generic path. See velox/benchmarks/ExpressionBenchmarkBuilder.h for an easy way to write benchmarks for expressions.

testStringToMap({"", ",", "="}, {{"", std::nullopt}});
testStringToMap(
{"a::1,b::2,c::3", ",", "c"},
{{"", "::3"}, {"a::1", std::nullopt}, {"b::2", std::nullopt}});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we adjust the output order to align with input order for readability?

Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting
each entry into key/value with ``keyValueDelim``.
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws
exception when duplicate map keys are found for single row's result, consistent with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add tests for cases that throw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already covered in test.


namespace facebook::velox::functions::sparksql {

template <typename TExecCtx>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TExecCtx -> T

std::string_view entryDelimiter,
std::string_view keyValueDelimiter) const {
size_t pos = 0;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this empty line?

expect) {
std::vector<VectorPtr> row;
row.emplace_back(makeFlatVector<StringView>({inputs[0]}));
std::string expr =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

row.emplace_back(makeFlatVector<StringView>({inputs[0]}));
std::string expr =
fmt::format("str_to_map(c0, '{}', '{}')", inputs[1], inputs[2]);
auto result = evaluate<MapVector>(expr, makeRowVector(row));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makeRowVector(row) -> makeRowVector({makeFlatVector({input})}) and remove row.

{{"", "::3"}, {"a::1", std::nullopt}, {"b::2", std::nullopt}});
testStringToMap(
{"a:1_b:2_c:3", "_", ":"}, {{"a", "1"}, {"b", "2"}, {"c", "3"}});
// Same delimiters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line above.

testStringToMap(
{"a:1_b:2_c:3", "_", "_"},
{{"a:1", std::nullopt}, {"b:2", std::nullopt}, {"c:3", std::nullopt}});
// Exception for duplicated keys.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty line above.

TEST_F(StringToMapTest, Basics) {
testStringToMap(
{"a:1,b:2,c:3", ",", ":"}, {{"a", "1"}, {"b", "2"}, {"c", "3"}});
testStringToMap({"a: ,b:2", ",", ":"}, {{"a", " "}, {"b", "2"}});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add test case for {"a:,b:2", ",", ":"}.

Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting
each entry into key/value with ``keyValueDelim``.
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any single-character allowed to be entry and key-value delimeter in Spark? Do we have limitations for their content?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not found the requirement in spark document. Assume there is no special requirement.

@PHILO-HE PHILO-HE force-pushed the str_to_map branch 4 times, most recently from bb19e08 to c8fe082 Compare December 6, 2023 06:09
*/
#include "velox/common/base/tests/GTestUtils.h"
#include "velox/functions/sparksql/tests/SparkFunctionBaseTest.h"
#include "velox/type/Type.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type.h seems to be not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just fixed. Thanks!

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 6, 2023

Hi @mbasmanova, I just fixed all comments from Rui.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE Looks good to me % a couple nits.

@rui-mo Rui, thank you for your help reviewing.

Returns a map by splitting ``string`` into entries with ``entryDelim`` and splitting
each entry into key/value with ``keyValueDelim``.
Only supports constant single-character ``entryDelim`` and ``keyValueDelim``. Throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: entryDelim and keyValueDelim must be constant single-character strings.

const arg_type<Varchar>& input,
const arg_type<Varchar>& entryDelimiter,
const arg_type<Varchar>& keyValueDelimiter) {
VELOX_USER_CHECK(!entryDelimiter.empty(), "entryDelimiter is empty.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check that their length is 1 character? Do we allow unicode characters?

Copy link
Contributor

@PHILO-HE PHILO-HE Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. Just updated. Actually, we only allow ascii character. Just clarified this in document.

std::string_view keyValueDelimiter,
folly::F14FastSet<std::string_view>& keys) const {
const auto delimiterPos = entry.find(keyValueDelimiter, 0);
// Allows keyValue delimiter not found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to describe this behavior in the documentation.

const auto key = std::string_view(entry.data(), delimiterPos);
VELOX_USER_CHECK(
keys.insert(key).second,
"Duplicate keys are not allowed: ('{}').",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for both parens and quotes. Quotes should be sufficient.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

const arg_type<Varchar>& input,
const arg_type<Varchar>& entryDelimiter,
const arg_type<Varchar>& keyValueDelimiter) {
VELOX_USER_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_CHECK_EQ

}
};

TEST_F(StringToMapTest, Basics) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: basic

VELOX_ASSERT_THROW(
evaluateStringToMap({"a:1,b:2,a:3", ",", ":"}),
"Duplicate keys are not allowed: 'a'.");
VELOX_ASSERT_THROW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also check for empty, non-ASCII, non-singlechar delimiters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added some test cases. Thanks!

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@laithsakka Laith, could you check these comments?

I'm not sure why there's a need to check std_interface in operator[] for K=Varchar, just comment it out for overwrite value in the map. The emplace function also seems to have the same issue for K=Varchar, but I've found that I can get around it by using add_item or add_null. Overall, it seems that the behavior is not very consistent.
cpp
  std::tuple<PrimitiveWriter<K, false>, PrimitiveWriter<V>> operator[](
      vector_size_t index) {
    // static_assert(std_interface, "operator [] not allowed for this map");
    VELOX_DCHECK_LT(index, length_, "out of bound access");
    return {
        PrimitiveWriter<K, false>{keysVector_, innerOffset_ + index},
        PrimitiveWriter<V>{valuesVector_, innerOffset_ + index}};
  }

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 6, 2023

@laithsakka Laith, could you check these comments?

I'm not sure why there's a need to check std_interface in operator[] for K=Varchar, just comment it out for overwrite value in the map. The emplace function also seems to have the same issue for K=Varchar, but I've found that I can get around it by using add_item or add_null. Overall, it seems that the behavior is not very consistent.
cpp
  std::tuple<PrimitiveWriter<K, false>, PrimitiveWriter<V>> operator[](
      vector_size_t index) {
    // static_assert(std_interface, "operator [] not allowed for this map");
    VELOX_DCHECK_LT(index, length_, "out of bound access");
    return {
        PrimitiveWriter<K, false>{keysVector_, innerOffset_ + index},
        PrimitiveWriter<V>{valuesVector_, innerOffset_ + index}};
  }

A bit clarification to avoid possible misleading. This mentioned change was included in the initial draft of Ma-Jian1 (the PR author). But I found it is not required at least for this PR actually. So the latest patch doesn't change this piece of code.

As I have no permission to modify the PR description, these comments are not removed. @mbasmanova, maybe we can ignore this issue.

@mbasmanova
Copy link
Contributor

As I have no permission to modify the PR description

I wonder why this is the case? Next time, if you find this problem, would you try creating your own PR rather than working on someoneelses?

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in af7d1ef.

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 7, 2023

As I have no permission to modify the PR description

I wonder why this is the case? Next time, if you find this problem, would you try creating your own PR rather than working on someoneelses?

Yes, makes sense. Thanks for your review!

weixiuli pushed a commit to weixiuli/velox that referenced this pull request Jan 17, 2024
…ookincubator#5842)

This commit integrates the community-provided implementation of str_to_map from PR facebookincubator#5842. The change addresses the issue where the previous in-house version of the function occasionally returned null in certain scenarios.

Changes made:
- Replaced the existing str_to_map function with the community version.
- Conducted thorough testing to ensure compatibility and correct behavior.
- Updated relevant documentation and comments to reflect the new source.

The newly adopted implementation resolves the null-return bug and is expected to improve the reliability of the str_to_map utility. The community-driven approach also brings the additional benefit of shared maintenance and potential future enhancements.
@rui-mo rui-mo mentioned this pull request Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants