-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional column_order in JSON reader #17029
base: branch-24.12
Are you sure you want to change the base?
Add optional column_order in JSON reader #17029
Conversation
@ttnghia This PR is ready for testing. This will enforce the column order and also insert empty all-null columns if not present in the json data. |
cpp/src/io/json/json_column.cu
Outdated
child_columns.emplace_back(std::move(all_null_column)); | ||
continue; | ||
} | ||
column_names.emplace_back(found_col->first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we have if (prune_columns and found_col == std::end
thus here if !prune_columns
then we still have found_col == std::end
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added prune_columns
condition to col_order
decision. This case won't happen.
bool const has_column_order = | ||
options.is_enabled_prune_columns() and | ||
std::holds_alternative<schema_element>(options.get_dtypes()) and | ||
std::get<schema_element>(options.get_dtypes()).column_order.has_value() and | ||
not std::get<schema_element>(options.get_dtypes()).column_order->empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we throw if options.is_enabled_prune_columns()
is false
but column order is given? The users may want to specify column order but they may forget to enable prune column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, the user wants to prune column, but maintain the order as in json file. that's why that throw was not added.
cpp/src/io/json/parser_features.cpp
Outdated
} else if (schema.type.id() == type_id::STRING) { | ||
info.children.emplace_back("offsets"); | ||
info.children.emplace_back("chars"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still correct? Now strings columns no longer have chars
child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #17240.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, nothing blocking
@@ -2827,7 +2828,7 @@ TEST_F(JsonReaderTest, JSONMixedTypeChildren) | |||
EXPECT_EQ(result.metadata.schema_info[0].name, "Root"); | |||
ASSERT_EQ(result.metadata.schema_info[0].children.size(), 1); | |||
EXPECT_EQ(result.metadata.schema_info[0].children[0].name, "Key"); | |||
ASSERT_EQ(result.metadata.schema_info[0].children[0].children.size(), 2); | |||
ASSERT_EQ(result.metadata.schema_info[0].children[0].children.size(), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address #17240
Strings column will not have chars
as child anymore.
Description
This PR adds optional column order to enforce column order in the output. This feature is required by spark from_json.
Optional
column_order
is added toschema_element
, and it is validated during reader_option creation. The column order can be specified at root level and for any struct in any level.• For root level, the dtypes should be schema_element with type STRUCT. (schema_element is added to variant dtypes)
• For nested level, column_order can be specified for any STRUCT type. (could be a map of schema_element , or schema_element)
If the column order is not specified, the order of columns is same as the order of columns that appear in json file.
Closes #17240 (metadata updated)
Closes #17091 (will return all nulls column if not present in input json)
Closes #17090 (fixed with new schema_element as dtype)
Closes #16799 (output columns are created from column_order if present)
Checklist