[Parquet] Implement Variant type support in Parquet #6736

alamb · 2024-11-15T20:50:48Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Parquet recently adopted the Variant type from Spark: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

Details on

GH-455: Add Variant specification docs parquet-format#456

Describe the solution you'd like
I would like to implement variant support in parquet-rs

Describe alternatives you've considered

Additional context
I am not sure if any other parquet implementations have implemented this yet / if there are example parquet files. I will attempt to find out

CurtHagenlocher · 2024-11-15T21:02:52Z

There's an implementation in Spark (try here for starters) but when I last looked ~two months ago there was no binary test data; only some round trips via JSON.

tustvold · 2024-12-04T15:05:55Z

I do wonder if a precursor to supporting this would be some way to translate / represent the variant data in arrow, whilst there are non-arrow APIs, they'd likely struggle to accommodate this addition, and aren't how the vast majority of people consume parquet data using this crate.

findepi · 2024-12-04T15:18:26Z

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

A fully performance variant implementation should be able to leverage file-level column disaggregation (shredding), but I do think this could come as a follow-up to a "normal" Variant type implementation.

tustvold · 2024-12-04T16:38:26Z

From arrow perspective, would that be a new DataType, or rather a convention of using DataType::Struct with two Binary fields?

I don't know, I've not really been following the variant proposal close enough to weigh in here. However, my understanding is that shredding is one of the major motivators for this getting added to parquet, as without it you might as well just embed any record format, e.g. Avro. I therefore suspect most use-cases will be at least partially shredded, and the reader will need to handle this case. This is especially true given the variant_value is NULL when the data is shredded, as opposed to say duplicating the content (which would have its own issues TBC), and so we can't just ignore the shredded data.

Unfortunately I can't see an obvious way to be able to represent this sort of semi-structured data within the arrow format without introducing a new DataType that is able to accommodate arrays having the same type, but different child layouts...

TLDR I suspect actioning this will require arrow defining a way to represent semi-structured data...

findepi · 2024-12-04T20:16:31Z

There needs to be a way to represent a series of variant values having "no type in common" (variant integer, variant boolean, variant varchar, etc all mixed up). For that some blob-like representation with internal structure seems natural.
Then there should be a way to carry-on the shredded columns without having to put them back into that blob, so yes, one type, different child layouts.
It feels to me that the runtime representation will end up being similar to what is defined in Parquet (https://github.com/apache/parquet-format/blob/master/VariantShredding.md)... so maybe it should be the same representation to provide for an efficient read path.

findepi · 2024-12-04T20:37:09Z

When considering what to do in Arrow, we should also keep an eye on the ongoing effort in Iceberg apache/iceberg#10392 (comment)
This could inform some design decisions.
cc @Xuanwo

alamb added the enhancement Any new improvement worthy of a entry in the changelog label Nov 15, 2024

This was referenced Nov 15, 2024

[Format] Consider adding an official variant type to Arrow apache/arrow#42069

Open

GH-455: Add Variant specification docs apache/parquet-format#456

Merged

alamb added the parquet Changes to the parquet crate label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] Implement Variant type support in Parquet #6736

[Parquet] Implement Variant type support in Parquet #6736

alamb commented Nov 15, 2024

CurtHagenlocher commented Nov 15, 2024

tustvold commented Dec 4, 2024 •

edited

Loading

findepi commented Dec 4, 2024

tustvold commented Dec 4, 2024

findepi commented Dec 4, 2024

findepi commented Dec 4, 2024

[Parquet] Implement Variant type support in Parquet #6736

[Parquet] Implement Variant type support in Parquet #6736

Comments

alamb commented Nov 15, 2024

CurtHagenlocher commented Nov 15, 2024

tustvold commented Dec 4, 2024 • edited Loading

findepi commented Dec 4, 2024

tustvold commented Dec 4, 2024

findepi commented Dec 4, 2024

findepi commented Dec 4, 2024

tustvold commented Dec 4, 2024 •

edited

Loading