Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(functions): Support for canonicalization of JSON #11284

Closed
wants to merge 1 commit into from

Conversation

kgpai
Copy link
Contributor

@kgpai kgpai commented Oct 17, 2024

This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 17, 2024
Copy link

netlify bot commented Oct 17, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 95d8911
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673e5508704150000821107e

@kgpai
Copy link
Contributor Author

kgpai commented Oct 17, 2024

The first version of the diff isnt optimized and creates an additional copy before writing to the buffer; It also uses stringstream etc for ease of use. I will be optimizing after ascertaining correctness.

@kgpai kgpai requested review from Yuhta and gggrace14 October 17, 2024 00:03
@kgpai kgpai force-pushed the json_parse_changes branch from 683f681 to 30a901e Compare October 17, 2024 00:07
Copy link
Contributor

@gggrace14 gggrace14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Will leave the perf improvement suggestion to Jimmy and watch.

Thank you for making this change @kgpai Krishna! I think it will work functionally.

rows.end(),
stringViews,
std::vector<BufferPtr>{});
auto flatResult = localResult->asFlatVector<StringView>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line redundant with line 178, localResult=? Both return a pointer to FlatVector

JsonLeafView(const StringView view) : view_(view){};

void canonicalize(std::stringstream& stream) override {
stream << view_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we are going to call the unicode escape func in the future, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right I will add your unicode escape when copying the string views out.

}

jsonView = std::make_shared<JsonArrayView>(arrayPtr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the constructor of JsonArrayView, we are copying the std::vector twice. What is the conventional way to do in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In almost all cases the compiler, I think the compiler will end up doing copy elision, so this should not create a copy as these types are not marked volatile . However to further satisfy you , i can do an explicit std::move.

if (!doc.at_end()) {
return simdjson::TRAILING_CONTENT;
}
return simdjson::SUCCESS;
}

template <typename T>
static simdjson::error_code validate(T value) {
static simdjson::error_code validate(T value, JsonViewPtr& jsonView) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me we're spreading the canonicalization changes to JsonView::canonicalize() and JsonParseFunction::validate(), with JsonParseFunction::validate() doing the whitespace trimming. Personally I would prefer to move all the canonicalization changes to JsonView::canonicalize() to be more modularized, and leave JsonParseFunction::validate() to do the simple validation as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Agreed, moving the trim etc to the canonicalize().

@kgpai kgpai force-pushed the json_parse_changes branch from 30a901e to 9bc80fb Compare October 28, 2024 19:43
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai
Copy link
Contributor Author

kgpai commented Oct 28, 2024

@Yuhta Do you have concerns with using stringstream, and and instead directly memcpy'ing to some string buffer?

@Yuhta
Copy link
Contributor

Yuhta commented Oct 29, 2024

@kgpai Ideally we should calculate the exact output size beforehand and allocate it once then writing into it.

@kgpai kgpai force-pushed the json_parse_changes branch 4 times, most recently from 49c10ef to 6859f21 Compare October 30, 2024 22:52
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai kgpai force-pushed the json_parse_changes branch from 6859f21 to 288d79f Compare November 1, 2024 21:46
@kgpai kgpai changed the title [Draft] [WIP] Add support for canonicalization of JSON. Add support for canonicalization of JSON. Nov 1, 2024
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai kgpai force-pushed the json_parse_changes branch from 288d79f to d57b02c Compare November 1, 2024 22:24
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai kgpai force-pushed the json_parse_changes branch from d57b02c to 80fae98 Compare November 1, 2024 22:36
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai kgpai force-pushed the json_parse_changes branch from 80fae98 to 8b17860 Compare November 4, 2024 19:39
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 4, 2024
Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Differential Revision: D65084925

Pulled By: kgpai
velox/functions/prestosql/JsonFunctions.cpp Show resolved Hide resolved
velox/functions/prestosql/JsonFunctions.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/JsonFunctions.cpp Outdated Show resolved Hide resolved
@@ -84,38 +162,71 @@ class JsonParseFunction : public exec::VectorFunction {
auto value = arg->as<ConstantVector<StringView>>()->valueAt(0);
paddedInput_.resize(value.size() + simdjson::SIMDJSON_PADDING);
memcpy(paddedInput_.data(), value.data(), value.size());
if (auto error = parse(value.size())) {
auto escapeSize = escapedStringSize(value.data(), value.size(), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder is it safe to pass the JSON directly into escapedStringSize like this, since it is meant for string node value in JSON, not the JSON itself. It might work though since outside string nodes everything should be ASCII.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escapeSize here is worst case size , Its hard to estimate the size only for the string nodes and then add the ascii parts ; I see no reason why this would be less than the size required, so I think this should be fine.

Copy link
Contributor

@Yuhta Yuhta Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can keep this for now and if weird bug showing up this is a suspicious point. Maybe leave a comment about the situation.

velox/functions/prestosql/JsonFunctions.cpp Outdated Show resolved Hide resolved
velox/functions/prestosql/JsonFunctions.cpp Outdated Show resolved Hide resolved
kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 17, 2024
Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from 6d0d2b5 to fd1ddd0 Compare November 17, 2024 20:31
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 17, 2024
Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from fd1ddd0 to 85f031c Compare November 17, 2024 20:33
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 18, 2024
…ator#11284)

Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from 85f031c to f091c48 Compare November 18, 2024 02:48
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

@kgpai kgpai changed the title Add support for canonicalization of JSON. feat(functions): Support for canonicalization of JSON. Nov 18, 2024
kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 18, 2024
…ator#11284)

Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from f091c48 to a9c2e21 Compare November 18, 2024 21:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

@assignUser assignUser changed the title feat(functions): Support for canonicalization of JSON. feat(functions): Support for canonicalization of JSON Nov 18, 2024
@kgpai kgpai changed the title feat(functions): Support for canonicalization of JSON feat(functions): Support for canonicalization of JSON. Nov 18, 2024
@kgpai kgpai changed the title feat(functions): Support for canonicalization of JSON. feat(functions): Support for canonicalization of JSON Nov 18, 2024
kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 20, 2024
…tor#11284)

Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from a9c2e21 to 2ce62a9 Compare November 20, 2024 07:26
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

kgpai added a commit to kgpai/velox-1 that referenced this pull request Nov 20, 2024
…tor#11284)

Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from 2ce62a9 to ce082db Compare November 20, 2024 17:58
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

…tor#11284)

Summary:
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.


Reviewed By: Yuhta, gggrace14

Differential Revision: D65084925

Pulled By: kgpai
@kgpai kgpai force-pushed the json_parse_changes branch from ce082db to 95d8911 Compare November 20, 2024 21:30
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65084925

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in de6a83d.

Copy link

Conbench analyzed the 1 benchmark run on commit de6a83dc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants