-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41764: [Parquet][C++] Support future logical types in the Parquet reader #41765
base: main
Are you sure you want to change the base?
GH-41764: [Parquet][C++] Support future logical types in the Parquet reader #41765
Conversation
|
cpp/src/parquet/types.cc
Outdated
@@ -464,7 +464,10 @@ std::shared_ptr<const LogicalType> LogicalType::FromThrift( | |||
} else if (type.__isset.FLOAT16) { | |||
return Float16LogicalType::Make(); | |||
} else { | |||
throw ParquetException("Metadata contains Thrift LogicalType that is not recognized"); | |||
// Or provide some mechanism to optionally error here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we might need an option to make it throw or ignore for unknown logical type.
BTW, it seems that parquet-java also fails on unknown type: https://github.com/apache/parquet-java/blob/6809a18597e56709696af58313343aeea93f0b99/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1142-L1179
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we regard it as physical type with unknown logical type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UNKNOWN logical type means all values are null, IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...it looks like there was already a concept for this (that was used for unknown legacy converted types): the UndefinedLogicalType()
. I'm not sure how to test this, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer UndefinedLogicalType
. I checked arrow-rs that it also raise error in this condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas on how to craft a test file with an unknown type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow/cpp/src/parquet/metadata.cc
Line 1920 in 13630c7
ToParquet(static_cast<parquet::schema::GroupNode*>(schema_->schema_root().get()), |
You might need to hack the thrift object directly from metadata_->schema
above by setting an unknown logical value. This means we have to create a parquet file any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't throw here at the very least we need to ignore statistics for the logical type (e.g. I believe sort order for float16 doesn't correspond to signed comparison)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't throw here at the very least we need to ignore statistics for the logical type
I did some investigation and I believe statistics are already ignored for UndefinedLogicalType
:
arrow/cpp/src/parquet/types.cc
Lines 324 to 332 in 93712bf
SortOrder::type GetSortOrder(const std::shared_ptr<const LogicalType>& logical_type, | |
Type::type primitive) { | |
SortOrder::type o = SortOrder::UNKNOWN; | |
if (logical_type && logical_type->is_valid()) { | |
o = (logical_type->is_none() ? DefaultSortOrder(primitive) | |
: logical_type->sort_order()); | |
} | |
return o; | |
} |
arrow/cpp/src/parquet/metadata.cc
Lines 254 to 272 in 93712bf
inline bool is_stats_set() const { | |
DCHECK(writer_version_ != nullptr); | |
// If the column statistics don't exist or column sort order is unknown | |
// we cannot use the column stats | |
if (!column_metadata_->__isset.statistics || | |
descr_->sort_order() == SortOrder::UNKNOWN) { | |
return false; | |
} | |
if (possible_stats_ == nullptr) { | |
possible_stats_ = MakeColumnStats(*column_metadata_, descr_); | |
} | |
EncodedStatistics encodedStatistics = possible_stats_->Encode(); | |
return writer_version_->HasCorrectStatistics(type(), encodedStatistics, | |
descr_->sort_order()); | |
} | |
inline std::shared_ptr<Statistics> statistics() const { | |
return is_stats_set() ? possible_stats_ : nullptr; | |
} |
(i.e., ColumnChunkMetadata::statistics()
will return nullptr
if logical_type->is_valid()
returns false
)
I still have to figure out how to write a full-on file so that this is tested properly! (I figured out how to test the schema conversion but not how to check a full-on file yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have to figure out how to write a full-on file so that this is tested properly! (I figured out how to test the schema conversion but not how to check a full-on file yet).
Probably hacking the generated thrift code with a very high field ID and writing out the file and then trying to read it back in could work?
a2bbbb7
to
19cedac
Compare
Just ping me if this is ready |
Thanks! I think the missing piece is an actual end-to-end test with a real Parquet file (and I haven't had time to circle back to the suggestion of hacking the generated thrift for an actual written-to-disk Parquet file). |
c3ff40c
to
5920261
Compare
@@ -239,6 +239,10 @@ class ParquetFile: | |||
resolution (e.g. 'ms'). Setting to None is equivalent to 'ns' | |||
and therefore INT96 timestamps will be inferred as timestamps | |||
in nanoseconds. | |||
convert_unknown_logical_types : bool, default false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default False
?
I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard? |
I don't have an opinion here, although I think that the I ran out of time figuring out how to generate a Parquet file I could use to test the Python bindings here, but I'm happy to resurrect this PR and give it another shot if there is interest. |
Rationale for this change
If new logical types are added to Parquet, how should they be handled? (Or is it easy enough to just support them as soon as they are added to the spec?)
What changes are included in this PR?
Experimental ones!
Are these changes tested?
Absolutely not!
Are there any user-facing changes?
Not yet!