Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44214: [C++] JsonExtensionType equality check ignores storage type #44215

Merged
merged 16 commits into from
Oct 8, 2024

Conversation

rok
Copy link
Member

@rok rok commented Sep 24, 2024

Rationale for this change

As noted in #13901 (review):

bool JsonExtensionType::ExtensionEquals(const ExtensionType& other) const {
  return other.extension_name() == this->extension_name();
}

This equality check does not take into account the storage type, but only the name.
As a consequence, a JsonExtensionType type will be seen as equal to JsonExtensionType<large_string>.

What changes are included in this PR?

This change introduces storage equality check into JsonExtensionType equality check.

This also fixes a storage type check in JsonExtensionType::Make.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@rok rok added this to the 18.0.0 milestone Sep 24, 2024
@rok rok linked an issue Sep 24, 2024 that may be closed by this pull request
Copy link

⚠️ GitHub issue #44214 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/extension/json.h Outdated Show resolved Hide resolved
@@ -763,7 +763,7 @@ TEST_F(TestConvertParquetSchema, ParquetSchemaArrowExtensions) {
props.set_arrow_extensions_enabled(true);
auto arrow_schema = ::arrow::schema(
{::arrow::field("json_1", ::arrow::extension::json(), true),
::arrow::field("json_2", ::arrow::extension::json(::arrow::large_utf8()),
::arrow::field("json_2", ::arrow::extension::json(::arrow::utf8()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making the test easier, can you keep large_utf8 and ensure it passes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There was a test where arrow schema was available and wasn't used at read time. I've addressed that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reconstruct storage type if it is not stored? In this case we know data is of JSON logical type, but we don't know what was it's storage type at write time. Am I missing something?

No, I'm asking you to write the test so that it reflects that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this? e522efb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another assertion to justify changing arrow_schema https://github.com/apache/arrow/pull/44215/files/94006f92085377fc4b72c2a2da844e36fb4d86d3..77c964b04972fc08f04957190866fb475b453c3c.
Let me know if something else should be changed.

@rok rok force-pushed the json_support_followup branch from 0c265ab to 1ba3ba7 Compare September 24, 2024 20:10
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 24, 2024
@rok rok requested a review from pitrou September 24, 2024 20:33
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 25, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 25, 2024
@rok rok force-pushed the json_support_followup branch from a415aaf to 6dff127 Compare September 26, 2024 14:14
@rok rok removed this from the 18.0.0 milestone Oct 2, 2024
@rok
Copy link
Member Author

rok commented Oct 2, 2024

@pitrou do you think you'd have time to review this for the 18.0.0 (~mid next week)?

@@ -1017,7 +1016,10 @@ Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* infer

// Restore extension type, if the storage type is the same as inferred
// from the Parquet type
if (ex_type.storage_type()->Equals(*inferred->field->type())) {
if (ex_type.storage_type()->Equals(*inferred->field->type()) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So suddently this condition only covers arrow.json, while it used to cover all extension types. Is this deliberate?

Copy link
Member Author

@rok rok Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was, but in light of your comment I'd say it's not a good idea. Removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, but why are we keeping the logical || condition here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only removed the IsSupportedStorageType part. The reason being if we have an extension type with large_utf8 storage the following test bellow will fail, because (if I remember correctly) we take parquet json logical type to mean storage type will be utf8. To fix that this proposes to override with inferred type whenever arrow.json.

{
// Parquet file contains Arrow schema. Extensions are enabled.
// json_1 and json_2 will be interpreted as json(utf8()) and json(large_utf8()).
ArrowReaderProperties props;
props.set_arrow_extensions_enabled(true);
std::shared_ptr<KeyValueMetadata> field_metadata =
::arrow::key_value_metadata({"foo", "bar"}, {"biz", "baz"});
auto arrow_schema = ::arrow::schema(
{::arrow::field("json_1", ::arrow::extension::json(), true, field_metadata),
::arrow::field("json_2", ::arrow::extension::json(::arrow::large_utf8()),
true)});
std::shared_ptr<KeyValueMetadata> metadata;
ASSERT_OK(ArrowSchemaToParquetMetadata(arrow_schema, metadata));
ASSERT_OK(ConvertSchema(parquet_fields, metadata, props));
CheckFlatSchema(arrow_schema, true /* check_metadata */);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case you want to check for, wouldn't it be better to have a separate branch for it?
Something like:

    } else if (inferred_type->id() == ::arrow::Type::EXTENSION &&
        ex_type.extension_name() == std::string("arrow.json") &&
         ::arrow::extension::JsonExtensionType::IsSupportedStorageType(inferred_type->storage_id())) {
      // Potential schema mismatch.
      //
      // Arrow extensions are ENABLED in Parquet.
      // origin_type is arrow::extension::json(...)
      // inferred_type is arrow::extension::json(arrow::utf8())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make it more idiomatic indeed. Changed.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 2, 2024
@rok
Copy link
Member Author

rok commented Oct 2, 2024

Thanks for the review @pitrou. I addressed the comments.

@rok rok requested a review from pitrou October 2, 2024 17:02
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 2, 2024
@rok
Copy link
Member Author

rok commented Oct 2, 2024

Sorry for the slow turnaround @pitrou. I've pushed a change.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 2, 2024
Comment on lines 1013 to 1014
::arrow::extension::JsonExtensionType::IsSupportedStorageType(
inferred_type->storage_id())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition IsSupportedStorageType(inferred_type->storage_id() should always be true, right? We cannot infer a json type with an incorrect storage type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Removing this check.

@rok rok requested a review from pitrou October 8, 2024 09:44
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Oct 8, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @rok ! This looks good now.

@pitrou pitrou merged commit 64891d1 into apache:main Oct 8, 2024
41 checks passed
@pitrou pitrou removed the awaiting changes Awaiting changes label Oct 8, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 64891d1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] JsonExtensionType equality check ignores storage type
4 participants