`ParquetRecordBatchStream` Should Return the Projected Schema #4023

msalib · 2023-04-05T20:01:23Z

Describe the bug

Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:

ParquetRecordBatchStream::schema will produce a Schema object that includes that metadata.
But ParquetRecordBatchStream will yield RecordBatches that have schema objects that don't have the metadata.

The problem is that if you create an ArrowWriter using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).

Expected behavior

I'd expect that either:

ParquetRecordBatchStream::schema produces a Schema without metadata, or
the RecordBatches produced by ParquetRecordBatchStream have the exact same schema as what ::schema returns, or
ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata

The text was updated successfully, but these errors were encountered:

tustvold · 2023-04-06T18:56:31Z

ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata

#4027 relaxes the check on ArrowWriter

ParquetRecordBatchStream::schema produces a Schema without metadata, or

I think this would be consistent with ParquetRecordBatchReader, and would ensure it matches the RecordBatch that are returned. ParquetRecordBatchStreamBuilder::schema should return the schema with metadata.

tustvold · 2023-11-28T20:22:40Z

Updated based on #5135 (comment)

I agree this is a bug

msalib added the bug label Apr 5, 2023

tustvold mentioned this issue Apr 6, 2023

Only require compatible batch schema in ArrowWriter #4027

Merged

Jefffrey mentioned this issue Nov 28, 2023

Parquet: clear metadata and project fields of ParquetRecordBatchStream::schema #5135

Merged

tustvold changed the title ~~ParquetRecordBatchStream is inconsistent about schemas~~ ParquetRecordBatchStream Should Return the Projected Schema Nov 28, 2023

tustvold closed this as completed in #5135 Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ParquetRecordBatchStream` Should Return the Projected Schema #4023

`ParquetRecordBatchStream` Should Return the Projected Schema #4023

msalib commented Apr 5, 2023

tustvold commented Apr 6, 2023

tustvold commented Nov 28, 2023

ParquetRecordBatchStream Should Return the Projected Schema #4023

ParquetRecordBatchStream Should Return the Projected Schema #4023

Comments

msalib commented Apr 5, 2023

tustvold commented Apr 6, 2023

tustvold commented Nov 28, 2023

`ParquetRecordBatchStream` Should Return the Projected Schema #4023

`ParquetRecordBatchStream` Should Return the Projected Schema #4023