Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Spark from_json function #11709

Closed
wants to merge 9 commits into from

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Dec 2, 2024

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

  • Failure Handling:
    On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would
    result in {NULL}.
  • Type Restrictions:
    Only ROW, ARRAY, and MAP types are allowed as root types.
  • Boolean Handling:
    Only true and false are considered valid boolean values. Numeric values or
    strings will result in NULL.
  • Integral Type Handling:
    Only integral values are valid for integral types. Floating-point values and
    strings will produce NULL.
  • Float/Double Handling:
    All numeric values are valid for float/double types. However, for strings, only
    specific values like "NaN" or "INF" are valid.
  • Array Handling:
    Spark allows a JSON object as input for an array schema only if the array is
    the root type and its child type is a ROW.
  • Map Handling:
    Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3}
    results in {"3": 3} instead of {3: 3}.
  • Row Handling:
    Spark supports partial output mode. However, it does not allow an input JSON
    array when parsing a ROW.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024
Copy link

netlify bot commented Dec 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b1ad181
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67b99dbb321dd00008d98822

@zhli1142015
Copy link
Contributor Author

cc @rui-mo and @PHILO-HE , thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some initial comments.

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

@zhli1142015
Copy link
Contributor Author

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

@zhli1142015 zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@rui-mo rui-mo changed the title feat: Add from_json Spark function feat: Add Spark from_json function Dec 10, 2024
@zhli1142015 zhli1142015 requested a review from rui-mo December 11, 2024 03:39
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Added some comments.

@zhli1142015 zhli1142015 requested a review from rui-mo December 12, 2024 10:56
@zhli1142015 zhli1142015 force-pushed the add_from_json branch 3 times, most recently from c3696df to d5d801b Compare December 17, 2024 04:22
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!

@zhli1142015
Copy link
Contributor Author

Kindly ping for this PR, @PHILO-HE and @rui-mo , thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added some nits.

Copy link
Contributor

@PHILO-HE PHILO-HE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some trivial comments. Thanks!

@zhli1142015
Copy link
Contributor Author

Thanks for the review, @rui-mo and @PHILO-HE. The CI failure is unrelated. Is it okay to merge?

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@rui-mo rui-mo added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 21, 2025
{std::nullopt, std::nullopt, std::nullopt});
auto input =
makeFlatVector<std::string>({R"("a": 1})", R"({a: 1})", R"({"a" 1})"});
testFromJson(input, makeRowVector({"a"}, {expected}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that if JSON doesn't match the type specified in the "cast", the result is null? Would you document that?

Test name says "invalidJson" suggesting that it works with malformed JSON, but the JSON in the test seems perfectly valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases actually contain invalid JSON. Null is returned, because the input string is unparsable. The first case is missing the opening {, the second case has a key without double quotes, and the third case is missing a :. The valid JSON string is {"a": 1}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhli1142015 Got it. Thank you for clarifying. What happens if JSON is valid but doesn't match the type? E.g. JSON is an array, but type is a map or a struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, null is returned, test case structEmptyArray covers this.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 48e2ed7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants