ParquetRecordBatchStream
Should Return the Projected Schema
#4023
Labels
ParquetRecordBatchStream
Should Return the Projected Schema
#4023
Describe the bug
Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:
ParquetRecordBatchStream::schema
will produce aSchema
object that includes that metadata.But
ParquetRecordBatchStream
will yieldRecordBatch
es that have schema objects that don't have the metadata.The problem is that if you create an
ArrowWriter
using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).Expected behavior
I'd expect that either:
ParquetRecordBatchStream::schema
produces aSchema
without metadata, orRecordBatch
es produced byParquetRecordBatchStream
have the exact same schema as what::schema
returns, orArrowWriter
should tolerate its supplied schema differing from the batch schemas provided towrite()
in metadataThe text was updated successfully, but these errors were encountered: