Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

10 changes: 9 additions & 1 deletion parquet/src/arrow/async_reader/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,15 @@ impl<T> std::fmt::Debug for ParquetRecordBatchStream<T> {
}

impl<T> ParquetRecordBatchStream<T> {
/// Returns the [`SchemaRef`] for this parquet file
/// Returns the [`SchemaRef`] for this parquet file.
///
/// Note that unlike its synchronous counterpart [`ParquetRecordBatchReader`], the [`SchemaRef`]
/// returned here will contain the original metadata, whereas [`ParquetRecordBatchReader`]
/// strips this metadata.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this, they use the same logic? The difference is that the returned RecordBatch lack the metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see how schema is created for ParquetRecordBatchReader here:

schema: Arc::new(Schema::new(levels.fields.clone())),

And here:

let schema = match array_reader.get_data_type() {
ArrowType::Struct(ref fields) => Schema::new(fields.clone()),
_ => unreachable!("Struct array reader's data type is not struct!"),
};

They only account for the fields, ignoring the metadata.

///
/// As such, this schema is the same as [`ParquetRecordBatchStreamBuilder::schema`] which also
/// contains the metadata, but differs from the [`RecordBatch`]es produced which have the metadata
/// in their [`RecordBatch::schema`] stripped.
pub fn schema(&self) -> &SchemaRef {
&self.schema
}
Expand Down
Loading