-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(functions): Support for canonicalization of JSON #11284
Conversation
✅ Deploy Preview for meta-velox canceled.
|
The first version of the diff isnt optimized and creates an additional copy before writing to the buffer; It also uses stringstream etc for ease of use. I will be optimizing after ascertaining correctness. |
683f681
to
30a901e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Will leave the perf improvement suggestion to Jimmy and watch.
Thank you for making this change @kgpai Krishna! I think it will work functionally.
rows.end(), | ||
stringViews, | ||
std::vector<BufferPtr>{}); | ||
auto flatResult = localResult->asFlatVector<StringView>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line redundant with line 178, localResult=
? Both return a pointer to FlatVector
JsonLeafView(const StringView view) : view_(view){}; | ||
|
||
void canonicalize(std::stringstream& stream) override { | ||
stream << view_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is where we are going to call the unicode escape func in the future, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is right I will add your unicode escape when copying the string views out.
} | ||
|
||
jsonView = std::make_shared<JsonArrayView>(arrayPtr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When looking at the constructor of JsonArrayView, we are copying the std::vector twice. What is the conventional way to do in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In almost all cases the compiler, I think the compiler will end up doing copy elision, so this should not create a copy as these types are not marked volatile . However to further satisfy you , i can do an explicit std::move.
if (!doc.at_end()) { | ||
return simdjson::TRAILING_CONTENT; | ||
} | ||
return simdjson::SUCCESS; | ||
} | ||
|
||
template <typename T> | ||
static simdjson::error_code validate(T value) { | ||
static simdjson::error_code validate(T value, JsonViewPtr& jsonView) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me we're spreading the canonicalization changes to JsonView::canonicalize()
and JsonParseFunction::validate()
, with JsonParseFunction::validate()
doing the whitespace trimming. Personally I would prefer to move all the canonicalization changes to JsonView::canonicalize()
to be more modularized, and leave JsonParseFunction::validate()
to do the simple validation as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Agreed, moving the trim etc to the canonicalize().
30a901e
to
9bc80fb
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@Yuhta Do you have concerns with using stringstream, and and instead directly memcpy'ing to some string buffer? |
@kgpai Ideally we should calculate the exact output size beforehand and allocate it once then writing into it. |
49c10ef
to
6859f21
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6859f21
to
288d79f
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
288d79f
to
d57b02c
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
d57b02c
to
80fae98
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
80fae98
to
8b17860
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Differential Revision: D65084925 Pulled By: kgpai
@@ -84,38 +162,71 @@ class JsonParseFunction : public exec::VectorFunction { | |||
auto value = arg->as<ConstantVector<StringView>>()->valueAt(0); | |||
paddedInput_.resize(value.size() + simdjson::SIMDJSON_PADDING); | |||
memcpy(paddedInput_.data(), value.data(), value.size()); | |||
if (auto error = parse(value.size())) { | |||
auto escapeSize = escapedStringSize(value.data(), value.size(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder is it safe to pass the JSON directly into escapedStringSize
like this, since it is meant for string node value in JSON, not the JSON itself. It might work though since outside string nodes everything should be ASCII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escapeSize here is worst case size , Its hard to estimate the size only for the string nodes and then add the ascii parts ; I see no reason why this would be less than the size required, so I think this should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can keep this for now and if weird bug showing up this is a suspicious point. Maybe leave a comment about the situation.
Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
6d0d2b5
to
fd1ddd0
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
fd1ddd0
to
85f031c
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
…ator#11284) Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
85f031c
to
f091c48
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
…ator#11284) Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
f091c48
to
a9c2e21
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
…tor#11284) Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
a9c2e21
to
2ce62a9
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
…tor#11284) Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
2ce62a9
to
ce082db
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
…tor#11284) Summary: This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto. Reviewed By: Yuhta, gggrace14 Differential Revision: D65084925 Pulled By: kgpai
ce082db
to
95d8911
Compare
This pull request was exported from Phabricator. Differential Revision: D65084925 |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This is preliminary PR that adds support for canonicalization of JSON strings. This initial PR only tackles canonicalization of json_parse. Another diff will handle CAST( _ as JSON) . Canonicalization is required since currently Velox just treats JSON as varchars thus equivalent JSON but having different backing varchar's are treated as separate JSON's which is incorrect and contrary to behavior shown by Presto.