Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetRecordBatchStream Should Return the Projected Schema #4023

Closed
msalib opened this issue Apr 5, 2023 · 2 comments · Fixed by #5135
Closed

ParquetRecordBatchStream Should Return the Projected Schema #4023

msalib opened this issue Apr 5, 2023 · 2 comments · Fixed by #5135
Labels

Comments

@msalib
Copy link
Contributor

msalib commented Apr 5, 2023

Describe the bug

Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:

ParquetRecordBatchStream::schema will produce a Schema object that includes that metadata.
But ParquetRecordBatchStream will yield RecordBatches that have schema objects that don't have the metadata.

The problem is that if you create an ArrowWriter using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).

Expected behavior

I'd expect that either:

  • ParquetRecordBatchStream::schema produces a Schema without metadata, or
  • the RecordBatches produced by ParquetRecordBatchStream have the exact same schema as what ::schema returns, or
  • ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata
@tustvold
Copy link
Contributor

tustvold commented Apr 6, 2023

ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata

#4027 relaxes the check on ArrowWriter

ParquetRecordBatchStream::schema produces a Schema without metadata, or

I think this would be consistent with ParquetRecordBatchReader, and would ensure it matches the RecordBatch that are returned. ParquetRecordBatchStreamBuilder::schema should return the schema with metadata.

@tustvold tustvold changed the title ParquetRecordBatchStream is inconsistent about schemas ParquetRecordBatchStream Should Return the Projected Schema Nov 28, 2023
@tustvold
Copy link
Contributor

Updated based on #5135 (comment)

I agree this is a bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants