Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41764: [Parquet][C++] Support future logical types in the Parquet reader #41765

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented May 21, 2024

Rationale for this change

If new logical types are added to Parquet, how should they be handled? (Or is it easy enough to just support them as soon as they are added to the spec?)

What changes are included in this PR?

Experimental ones!

Are these changes tested?

Absolutely not!

Are there any user-facing changes?

Not yet!

@paleolimbot paleolimbot requested a review from wgtmac as a code owner May 21, 2024 20:07
@paleolimbot paleolimbot marked this pull request as draft May 21, 2024 20:08
Copy link

⚠️ GitHub issue #41764 has been automatically assigned in GitHub to PR creator.

@@ -464,7 +464,10 @@ std::shared_ptr<const LogicalType> LogicalType::FromThrift(
} else if (type.__isset.FLOAT16) {
return Float16LogicalType::Make();
} else {
throw ParquetException("Metadata contains Thrift LogicalType that is not recognized");
// Or provide some mechanism to optionally error here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we might need an option to make it throw or ignore for unknown logical type.

BTW, it seems that parquet-java also fails on unknown type: https://github.com/apache/parquet-java/blob/6809a18597e56709696af58313343aeea93f0b99/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1142-L1179

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we regard it as physical type with unknown logical type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UNKNOWN logical type means all values are null, IIUC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...it looks like there was already a concept for this (that was used for unknown legacy converted types): the UndefinedLogicalType(). I'm not sure how to test this, though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer UndefinedLogicalType. I checked arrow-rs that it also raise error in this condition

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on how to craft a test file with an unknown type?

Copy link
Member

@wgtmac wgtmac May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToParquet(static_cast<parquet::schema::GroupNode*>(schema_->schema_root().get()),

You might need to hack the thrift object directly from metadata_->schema above by setting an unknown logical value. This means we have to create a parquet file any way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't throw here at the very least we need to ignore statistics for the logical type (e.g. I believe sort order for float16 doesn't correspond to signed comparison)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't throw here at the very least we need to ignore statistics for the logical type

I did some investigation and I believe statistics are already ignored for UndefinedLogicalType:

SortOrder::type GetSortOrder(const std::shared_ptr<const LogicalType>& logical_type,
Type::type primitive) {
SortOrder::type o = SortOrder::UNKNOWN;
if (logical_type && logical_type->is_valid()) {
o = (logical_type->is_none() ? DefaultSortOrder(primitive)
: logical_type->sort_order());
}
return o;
}

inline bool is_stats_set() const {
DCHECK(writer_version_ != nullptr);
// If the column statistics don't exist or column sort order is unknown
// we cannot use the column stats
if (!column_metadata_->__isset.statistics ||
descr_->sort_order() == SortOrder::UNKNOWN) {
return false;
}
if (possible_stats_ == nullptr) {
possible_stats_ = MakeColumnStats(*column_metadata_, descr_);
}
EncodedStatistics encodedStatistics = possible_stats_->Encode();
return writer_version_->HasCorrectStatistics(type(), encodedStatistics,
descr_->sort_order());
}
inline std::shared_ptr<Statistics> statistics() const {
return is_stats_set() ? possible_stats_ : nullptr;
}

(i.e., ColumnChunkMetadata::statistics() will return nullptr if logical_type->is_valid() returns false)

I still have to figure out how to write a full-on file so that this is tested properly! (I figured out how to test the schema conversion but not how to check a full-on file yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to figure out how to write a full-on file so that this is tested properly! (I figured out how to test the schema conversion but not how to check a full-on file yet).

Probably hacking the generated thrift code with a very high field ID and writing out the file and then trying to read it back in could work?

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels May 22, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jun 6, 2024
@paleolimbot paleolimbot force-pushed the parquet-unknown-logical-type-fallback branch from a2bbbb7 to 19cedac Compare June 21, 2024 14:59
@github-actions github-actions bot added Component: Python awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 21, 2024
@mapleFU
Copy link
Member

mapleFU commented Jul 3, 2024

Just ping me if this is ready

@paleolimbot
Copy link
Member Author

Just ping me if this is ready

Thanks! I think the missing piece is an actual end-to-end test with a real Parquet file (and I haven't had time to circle back to the suggestion of hacking the generated thrift for an actual written-to-disk Parquet file).

@paleolimbot paleolimbot force-pushed the parquet-unknown-logical-type-fallback branch from c3ff40c to 5920261 Compare July 15, 2024 14:39
@@ -239,6 +239,10 @@ class ParquetFile:
resolution (e.g. 'ms'). Setting to None is equivalent to 'ns'
and therefore INT96 timestamps will be inferred as timestamps
in nanoseconds.
convert_unknown_logical_types : bool, default false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default False?

@pitrou
Copy link
Member

pitrou commented Nov 18, 2024

I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard?

@paleolimbot
Copy link
Member Author

I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard?

I don't have an opinion here, although I think that the arrow.opaque (vendor: Parquet, type: Unknown Logical type with ID XXX) extension type makes it easier to make the case for having this be the default. Without that, there is a possibly good argument for erroring to avoid silently discarding logical type information.

I ran out of time figuring out how to generate a Parquet file I could use to test the Python bindings here, but I'm happy to resurrect this PR and give it another shot if there is interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants