-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44214: [C++] JsonExtensionType equality check ignores storage type #44215
Conversation
|
72bb26c
to
4d591e5
Compare
@@ -763,7 +763,7 @@ TEST_F(TestConvertParquetSchema, ParquetSchemaArrowExtensions) { | |||
props.set_arrow_extensions_enabled(true); | |||
auto arrow_schema = ::arrow::schema( | |||
{::arrow::field("json_1", ::arrow::extension::json(), true), | |||
::arrow::field("json_2", ::arrow::extension::json(::arrow::large_utf8()), | |||
::arrow::field("json_2", ::arrow::extension::json(::arrow::utf8()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of making the test easier, can you keep large_utf8
and ensure it passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(There was a test where arrow schema was available and wasn't used at read time. I've addressed that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?
No, I'm asking you to write the test so that it reflects that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this? e522efb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another assertion to justify changing arrow_schema
https://github.com/apache/arrow/pull/44215/files/94006f92085377fc4b72c2a2da844e36fb4d86d3..77c964b04972fc08f04957190866fb475b453c3c.
Let me know if something else should be changed.
0c265ab
to
1ba3ba7
Compare
a415aaf
to
6dff127
Compare
@pitrou do you think you'd have time to review this for the 18.0.0 (~mid next week)? |
cpp/src/parquet/arrow/schema.cc
Outdated
@@ -1017,7 +1016,10 @@ Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* infer | |||
|
|||
// Restore extension type, if the storage type is the same as inferred | |||
// from the Parquet type | |||
if (ex_type.storage_type()->Equals(*inferred->field->type())) { | |||
if (ex_type.storage_type()->Equals(*inferred->field->type()) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So suddently this condition only covers arrow.json
, while it used to cover all extension types. Is this deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was, but in light of your comment I'd say it's not a good idea. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but why are we keeping the logical ||
condition here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only removed the IsSupportedStorageType
part. The reason being if we have an extension type with large_utf8
storage the following test bellow will fail, because (if I remember correctly) we take parquet json logical type to mean storage type will be utf8
. To fix that this proposes to override with inferred type whenever arrow.json
.
arrow/cpp/src/parquet/arrow/arrow_schema_test.cc
Lines 804 to 820 in 6d156d1
{ | |
// Parquet file contains Arrow schema. Extensions are enabled. | |
// json_1 and json_2 will be interpreted as json(utf8()) and json(large_utf8()). | |
ArrowReaderProperties props; | |
props.set_arrow_extensions_enabled(true); | |
std::shared_ptr<KeyValueMetadata> field_metadata = | |
::arrow::key_value_metadata({"foo", "bar"}, {"biz", "baz"}); | |
auto arrow_schema = ::arrow::schema( | |
{::arrow::field("json_1", ::arrow::extension::json(), true, field_metadata), | |
::arrow::field("json_2", ::arrow::extension::json(::arrow::large_utf8()), | |
true)}); | |
std::shared_ptr<KeyValueMetadata> metadata; | |
ASSERT_OK(ArrowSchemaToParquetMetadata(arrow_schema, metadata)); | |
ASSERT_OK(ConvertSchema(parquet_fields, metadata, props)); | |
CheckFlatSchema(arrow_schema, true /* check_metadata */); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case you want to check for, wouldn't it be better to have a separate branch for it?
Something like:
} else if (inferred_type->id() == ::arrow::Type::EXTENSION &&
ex_type.extension_name() == std::string("arrow.json") &&
::arrow::extension::JsonExtensionType::IsSupportedStorageType(inferred_type->storage_id())) {
// Potential schema mismatch.
//
// Arrow extensions are ENABLED in Parquet.
// origin_type is arrow::extension::json(...)
// inferred_type is arrow::extension::json(arrow::utf8())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would make it more idiomatic indeed. Changed.
Thanks for the review @pitrou. I addressed the comments. |
Sorry for the slow turnaround @pitrou. I've pushed a change. |
cpp/src/parquet/arrow/schema.cc
Outdated
::arrow::extension::JsonExtensionType::IsSupportedStorageType( | ||
inferred_type->storage_id())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition IsSupportedStorageType(inferred_type->storage_id()
should always be true, right? We cannot infer a json type with an incorrect storage type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Removing this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @rok ! This looks good now.
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 64891d1. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
As noted in #13901 (review):
What changes are included in this PR?
This change introduces storage equality check into
JsonExtensionType
equality check.This also fixes a storage type check in
JsonExtensionType::Make
.Are these changes tested?
Yes.
Are there any user-facing changes?
No.