Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Implement Variant type support in Parquet #6736

Open
alamb opened this issue Nov 15, 2024 · 6 comments
Open

[Parquet] Implement Variant type support in Parquet #6736

alamb opened this issue Nov 15, 2024 · 6 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Nov 15, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Parquet recently adopted the Variant type from Spark: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

Details on

Describe the solution you'd like
I would like to implement variant support in parquet-rs

Describe alternatives you've considered

Additional context
I am not sure if any other parquet implementations have implemented this yet / if there are example parquet files. I will attempt to find out

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 15, 2024
@alamb alamb added the parquet Changes to the parquet crate label Nov 15, 2024
@CurtHagenlocher
Copy link

There's an implementation in Spark (try here for starters) but when I last looked ~two months ago there was no binary test data; only some round trips via JSON.

@tustvold
Copy link
Contributor

tustvold commented Dec 4, 2024

I do wonder if a precursor to supporting this would be some way to translate / represent the variant data in arrow, whilst there are non-arrow APIs, they'd likely struggle to accommodate this addition, and aren't how the vast majority of people consume parquet data using this crate.

@findepi
Copy link
Member

findepi commented Dec 4, 2024

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

A fully performance variant implementation should be able to leverage file-level column disaggregation (shredding), but I do think this could come as a follow-up to a "normal" Variant type implementation.

@tustvold
Copy link
Contributor

tustvold commented Dec 4, 2024

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

I don't know, I've not really been following the variant proposal close enough to weigh in here. However, my understanding is that shredding is one of the major motivators for this getting added to parquet, as without it you might as well just embed any record format, e.g. Avro. I therefore suspect most use-cases will be at least partially shredded, and the reader will need to handle this case. This is especially true given the variant_value is NULL when the data is shredded, as opposed to say duplicating the content (which would have its own issues TBC), and so we can't just ignore the shredded data.

Unfortunately I can't see an obvious way to be able to represent this sort of semi-structured data within the arrow format without introducing a new DataType that is able to accommodate arrays having the same type, but different child layouts...

TLDR I suspect actioning this will require arrow defining a way to represent semi-structured data...

@findepi
Copy link
Member

findepi commented Dec 4, 2024

There needs to be a way to represent a series of variant values having "no type in common" (variant integer, variant boolean, variant varchar, etc all mixed up). For that some blob-like representation with internal structure seems natural.
Then there should be a way to carry-on the shredded columns without having to put them back into that blob, so yes, one type, different child layouts.
It feels to me that the runtime representation will end up being similar to what is defined in Parquet (https://github.com/apache/parquet-format/blob/master/VariantShredding.md)... so maybe it should be the same representation to provide for an efficient read path.

@findepi
Copy link
Member

findepi commented Dec 4, 2024

When considering what to do in Arrow, we should also keep an eye on the ongoing effort in Iceberg apache/iceberg#10392 (comment)
This could inform some design decisions.
cc @Xuanwo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants