-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Implement Variant type support in Parquet #6736
Comments
There's an implementation in Spark (try here for starters) but when I last looked ~two months ago there was no binary test data; only some round trips via JSON. |
I do wonder if a precursor to supporting this would be some way to translate / represent the variant data in arrow, whilst there are non-arrow APIs, they'd likely struggle to accommodate this addition, and aren't how the vast majority of people consume parquet data using this crate. |
From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields? A fully performance variant implementation should be able to leverage file-level column disaggregation (shredding), but I do think this could come as a follow-up to a "normal" Variant type implementation. |
I don't know, I've not really been following the variant proposal close enough to weigh in here. However, my understanding is that shredding is one of the major motivators for this getting added to parquet, as without it you might as well just embed any record format, e.g. Avro. I therefore suspect most use-cases will be at least partially shredded, and the reader will need to handle this case. This is especially true given the variant_value is NULL when the data is shredded, as opposed to say duplicating the content (which would have its own issues TBC), and so we can't just ignore the shredded data. Unfortunately I can't see an obvious way to be able to represent this sort of semi-structured data within the arrow format without introducing a new DataType that is able to accommodate arrays having the same type, but different child layouts... TLDR I suspect actioning this will require arrow defining a way to represent semi-structured data... |
There needs to be a way to represent a series of variant values having "no type in common" (variant integer, variant boolean, variant varchar, etc all mixed up). For that some blob-like representation with internal structure seems natural. |
When considering what to do in Arrow, we should also keep an eye on the ongoing effort in Iceberg apache/iceberg#10392 (comment) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Parquet recently adopted the Variant type from Spark: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
Details on
Describe the solution you'd like
I would like to implement variant support in parquet-rs
Describe alternatives you've considered
Additional context
I am not sure if any other parquet implementations have implemented this yet / if there are example parquet files. I will attempt to find out
The text was updated successfully, but these errors were encountered: