Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional column_order in JSON reader #17029

Open
wants to merge 26 commits into
base: branch-24.12
Choose a base branch
from

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 9, 2024

Description

This PR adds optional column order to enforce column order in the output. This feature is required by spark from_json.

Optional column_order is added to schema_element, and it is validated during reader_option creation. The column order can be specified at root level and for any struct in any level.
• For root level, the dtypes should be schema_element with type STRUCT. (schema_element is added to variant dtypes)
• For nested level, column_order can be specified for any STRUCT type. (could be a map of schema_element , or schema_element)
If the column order is not specified, the order of columns is same as the order of columns that appear in json file.

Closes #17240 (metadata updated)
Closes #17091 (will return all nulls column if not present in input json)
Closes #17090 (fixed with new schema_element as dtype)
Closes #16799 (output columns are created from column_order if present)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann self-assigned this Oct 9, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 9, 2024
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
@karthikeyann
Copy link
Contributor Author

@ttnghia This PR is ready for testing. This will enforce the column order and also insert empty all-null columns if not present in the json data.

child_columns.emplace_back(std::move(all_null_column));
continue;
}
column_names.emplace_back(found_col->first);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we have if (prune_columns and found_col == std::end thus here if !prune_columns then we still have found_col == std::end.

Copy link
Contributor Author

@karthikeyann karthikeyann Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added prune_columns condition to col_order decision. This case won't happen.

@karthikeyann karthikeyann added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024
Comment on lines +605 to +609
bool const has_column_order =
options.is_enabled_prune_columns() and
std::holds_alternative<schema_element>(options.get_dtypes()) and
std::get<schema_element>(options.get_dtypes()).column_order.has_value() and
not std::get<schema_element>(options.get_dtypes()).column_order->empty();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we throw if options.is_enabled_prune_columns() is false but column order is given? The users may want to specify column order but they may forget to enable prune column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, the user wants to prune column, but maintain the order as in json file. that's why that throw was not added.

cpp/include/cudf/io/json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Show resolved Hide resolved
cpp/src/io/json/nested_json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/nested_json.hpp Outdated Show resolved Hide resolved
cpp/src/io/json/parser_features.cpp Outdated Show resolved Hide resolved
Comment on lines 200 to 202
} else if (schema.type.id() == type_id::STRING) {
info.children.emplace_back("offsets");
info.children.emplace_back("chars");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still correct? Now strings columns no longer have chars child.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #17240.

@ttnghia ttnghia removed their assignment Nov 5, 2024
@karthikeyann karthikeyann changed the title add optional column_order to schema_element Add optional column_order to schema_element in JSON reader Nov 6, 2024
@karthikeyann karthikeyann changed the title Add optional column_order to schema_element in JSON reader Add optional column_order in JSON reader Nov 6, 2024
@karthikeyann karthikeyann added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 6, 2024
@vuule vuule self-requested a review November 6, 2024 17:13
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, nothing blocking

cpp/src/io/json/nested_json.hpp Show resolved Hide resolved
@@ -2827,7 +2828,7 @@ TEST_F(JsonReaderTest, JSONMixedTypeChildren)
EXPECT_EQ(result.metadata.schema_info[0].name, "Root");
ASSERT_EQ(result.metadata.schema_info[0].children.size(), 1);
EXPECT_EQ(result.metadata.schema_info[0].children[0].name, "Key");
ASSERT_EQ(result.metadata.schema_info[0].children[0].children.size(), 2);
ASSERT_EQ(result.metadata.schema_info[0].children[0].children.size(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address #17240
Strings column will not have chars as child anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
5 participants