-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Spark from_json function #11709
Conversation
✅ Deploy Preview for meta-velox canceled.
|
006efc5
to
89d888e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added some initial comments.
d1c7d69
to
d74a262
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json
in Spark and make sure the result is correct.
The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as The only existing unit tests in Spark related to this function are found in |
a284e49
to
2762885
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update! Added some comments.
68dab93
to
5bdc4c2
Compare
c3696df
to
d5d801b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!
f19beba
to
e3e80be
Compare
0728adb
to
00ec76e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added some nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some trivial comments. Thanks!
8563cc0
to
658379d
Compare
658379d
to
9f791fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
{std::nullopt, std::nullopt, std::nullopt}); | ||
auto input = | ||
makeFlatVector<std::string>({R"("a": 1})", R"({a: 1})", R"({"a" 1})"}); | ||
testFromJson(input, makeRowVector({"a"}, {expected})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the case that if JSON doesn't match the type specified in the "cast", the result is null? Would you document that?
Test name says "invalidJson" suggesting that it works with malformed JSON, but the JSON in the test seems perfectly valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases actually contain invalid JSON. Null is returned, because the input string is unparsable. The first case is missing the opening {
, the second case has a key without double quotes, and the third case is missing a :
. The valid JSON string is {"a": 1}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhli1142015 Got it. Thank you for clarifying. What happens if JSON is valid but doesn't match the type? E.g. JSON is an array, but type is a map or a struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, null is returned, test case structEmptyArray
covers this.
address comments address comments address comments address comments address comments minor change address comments minor change
148a33d
to
b1ad181
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in 48e2ed7. |
Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would
result in {NULL}.
Only ROW, ARRAY, and MAP types are allowed as root types.
Only true and false are considered valid boolean values. Numeric values or
strings will result in NULL.
Only integral values are valid for integral types. Floating-point values and
strings will produce NULL.
All numeric values are valid for float/double types. However, for strings, only
specific values like "NaN" or "INF" are valid.
Spark allows a JSON object as input for an array schema only if the array is
the root type and its child type is a ROW.
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3}
results in {"3": 3} instead of {3: 3}.
Spark supports partial output mode. However, it does not allow an input JSON
array when parsing a ROW.