Finish implementing Variant::Object and Variant::List #7666

scovich · 2025-06-13T23:29:01Z

Which issue does this PR close?

Closes [Variant] Implement VariantObject::field and VariantObject::fields #7665

Rationale for this change

Continuing the ongoing variant implementation effort.

What changes are included in this PR?

As per title -- implement fairly complete support for variant objects and arrays. Also add some unit tests.

Note: This PR renames VariantArray as VariantList to align with parquet and arrow terminology, and to not conflict with the VariantArray we will eventually need to define for holding an arrow array of variant-typed data.

Are there any user-facing changes?

Those variant subtypes should now be usable.

scovich · 2025-06-13T23:29:46Z

Attn @alamb @mkarbo since I can't request reviewers.

alamb · 2025-06-16T17:49:46Z

I merged up from main to resolve a conflict with this branch

alamb

Thank you so much @scovich 🙏 -- this is looking quite close. I think we should fix the sorted dictionary thing and ensure the objects from parquet-testing can be read before merging this PR but otherwise everything else could be done as a follow on

alamb · 2025-06-16T17:06:36Z

parquet-variant/src/variant.rs

    pub metadata: &'m VariantMetadata<'m>,
    pub value: &'v [u8],


It wasn't added in this PR but I think we should probably make these fields non public (we can add accessors or something) so we can

Hint people use the nicer APIs

Potentially change the implementation if needed

We can do this in some other PR too, I just wanted to point this out

alamb · 2025-06-16T17:06:58Z

parquet-variant/src/variant.rs

    pub metadata: &'m VariantMetadata<'m>,
    pub value: &'v [u8],
+    header: VariantObjectHeader,


I think this is a good design -- parse / validate the relevant fields from the header once and then save them to be used in subsequent passes

alamb · 2025-06-16T17:25:18Z

parquet-variant/src/variant.rs

+    pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> {
+        // Binary search through the sorted field IDs to find the field
+        let (field_ids, field_offsets) = self.parse_field_arrays()?;
+        let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {


I think it is only correct to use binary search for the field name if the metadata has the fields sorted.

Perhaps for now we can just update this PR to return a NotYetImplemented error if the dictionary is not sorted.

Suggested change

let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {

if !self.metadata.is_sorted() {

return Err(ArrowError::NotYetImplemented(

"Cannot search for fields in an unsorted VariantObject".to_string(),

));

}

let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {

This also confused me at first... I probably should have added a code comment.

This binary search is over the field names of the variant object itself, which are indirectly referenced via the metadata dictionary. And the spec does require those to be lexically ordered:

The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary.

(basically, if the requested field name actually exists, it must match the name referenced by one of the struct's field ids... and we can binary search them because those ids are in lexical order according to their backing dictionary entries)

Aside: It would be incorrect to directly search the metadata dictionary, because that could "find" a field name that doesn't actually exist in the current object.

alamb · 2025-06-16T17:28:12Z

parquet-variant/src/variant.rs

-    pub fn get(&self, index: usize) -> Result<Variant<'m, 'v>, ArrowError> {
+    pub fn field(&self, name: &str) -> Result<Option<Variant<'m, 'v>>, ArrowError> {
+        // Binary search through the sorted field IDs to find the field
+        let (field_ids, field_offsets) = self.parse_field_arrays()?;


We can probably optimize this code in a future PR -- I don't think we need to create a whole Vec<..> just to search

Maybe we can implement a OffsetSize::try_binary_search_by type method that directly computes the offsets during the search.

Similarly, we could also add a OffsetSize::try_linear_search_by method that directly does the linear search when the dictionary is not sorted

I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.

As for try_linear_search_by -- we may eventually need to define it as part of the work to support unshredding of shredded variants (because then we have to find the name of each shredded field in the possibly unordered dictionary), but I think we can defer that for now (see other comment thread).

I fully agree this is not an optimal API -- I just implemented the existing stub methods to give a starting point we can iterate on.

I think this is a good plan

alamb · 2025-06-16T17:29:36Z

parquet-variant/src/variant.rs

+}
+
+#[derive(Clone, Debug, PartialEq)]
+pub struct VariantListHeader {


I wonder if this needs to be pub or if it is ok for it just to be pub(crate)

IMO everything is too visible right now... it should all be pub(crate) or less until we see actual reasons to make it pub. But that seems like a piece of general follow-on work?

Yup -- a follow on would be great

Meanwhile, I made the new header structs and their methods pub(crate). We can fix the others later once use cases are clearer.

alamb · 2025-06-16T17:32:25Z

parquet-variant/src/variant.rs

@@ -717,4 +980,257 @@ mod tests {
            "unexpected error: {err:?}"
        );
    }
+


Some other things to test here would be:

objects that have "is_large" set (aka have more than 256 distinct field names)

Do we actually need 256+ field names in the test? Or just verify our ability to process the wider offsets correctly?

alamb · 2025-06-16T17:33:22Z

parquet-variant/src/variant.rs

+        let last_field_offset_byte =
+            field_offsets_start_byte + (num_elements + 1) * field_offset_size as usize;
+        if last_field_offset_byte > value.len() {
+            return Err(ArrowError::InvalidArgumentError(format!(


it would be great eventually to cover these error cases with tests too (aka verify invalid inputs). I don't think it is needed for this PR

Filed as #7681

alamb · 2025-06-16T17:38:19Z

parquet-variant/src/variant.rs

+// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be
+// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
+// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.


Suggested change

// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be

// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the

// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.

/// Represents an Variant `Array`

///

/// NOTE: The `List` naming differs from the variant spec, which uses "array" in order to be

/// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the

/// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.

alamb · 2025-06-16T17:51:03Z

parquet-variant/src/variant.rs

+
+    /// Returns the offset size in bytes
+    pub fn offset_size(&self) -> usize {
+        self.offset_size as _


if we ever need to optimize the size of the VariantObjectHeader we can potentially just store the header byte and re-extract the appropriate bits

The OffsetSizeBytes enum occupies 1 byte in practice, so I don't think we'd save much?
The offsets (2x usize) will anyway dwarf it.

alamb · 2025-06-16T17:53:10Z

parquet-variant/tests/variant_interop.rs

@@ -99,7 +99,7 @@ fn variant_non_primitive() -> Result<(), ArrowError> {
                assert_eq!(dict_val, "int_field");
            }
            "array_primitive" => match variant {
-                Variant::Array(arr) => {
+                Variant::List(arr) => {


I think it is important to make sure this code works for the pre-existing test data from parqet-testing -- I will try and make a PR to update these tests and verify this PR's implementation

Merged it in, thanks!

alamb

I made a PR to your branch that I think shows this code works well

Add varant_interop tests for objects and lists/arrays scovich/arrow-rs#1

alamb · 2025-06-16T18:31:18Z

parquet-variant/src/variant.rs

+        self.len() == 0
+    }
+
+    pub fn values(&self) -> Result<impl Iterator<Item = Variant<'m, 'v>>, ArrowError> {


Something I noticed here was that it would be really nice if:

This returned an Iterator rather than a Result

We could implement iter() and IntoIterator for VariantList

That would make using it more ergonomic, though it would require either panic'ing or else validating the offsets on construction 🤔

Hmm, this is tricky. Ideally we wouldn't even materialize the result just to iterate over it, but that would require an iterator with Item = Result which is yet more annoying.

Definitely not a fan of allowing untrusted input to cause a panic.

But if we want to make this method infallible, we'd need to pay the validation cost in the constructor.

So it seems like we have two choices:

Pay O(n) to validate the offsets in the constructor, and unwrap here.

PRO: Cleaner API, allows to iterate without materializing the result first

Keep as-is or even return an iterator of result instead of result of iterator

PRO: Provably panic-free (no unwrap to reason about)

If we went for 1/, I'd favor an internal method that returns an iterator of result. The constructor would instantiate the iterator and verify it's all-ok, at which point we know values can safely invoke map(Result::unwrap) because we already consumed the iterator once. Otherwise, I worry the checks in the constructor could diverge from the checks we trigger here, causing an unexpected panic.

Yeah, I think in general this is hitting the tension you mentioned in the earlier PRs of early vs late validation.

The more I think about it the more I think we should move the validation to construction because:

The only reason to create a Variant in the first place is to access its data, so I think we would end up validating it almost immediately on read

Variants are constructed once but read many times so validating up front is probably faster

We could offer an unchecked variant for construction if performance overhead is hight that skips the validation

Filed two related issues

[Variant] Improve API for iterating over values of a VariantList #7685

[Variant] Consider validating variants on creation (rather than read) #7684

alamb · 2025-06-16T21:00:47Z

@scovich -- what do you recommend for next steps?

Shall I merge this PR and we can keep iterating on main?
WOuld you like to work on it some more?

I would like to get it merged in sooner rather than later to minimize conflicts such as #7670 (review)

scovich · 2025-06-16T23:39:45Z

@scovich -- what do you recommend for next steps?
1. Shall I merge this PR and we can keep iterating on main?

2. WOuld you like to work on it some more?
I would like to get it merged in sooner rather than later to minimize conflicts such as #7670 (review)

I addressed most of your review points and pushed; the only big outstanding question is how to handle the iterators. But we can probably tackle that in a separate PR?

alamb · 2025-06-17T11:14:59Z

parquet-variant/src/variant.rs

-        // Binary search through the sorted field IDs to find the field
+        // Binary search through the field IDs of this object to find the requested field name.
+        //
+        // NOTE: This does not require a sorted metadata dictionary, because the variant spec


alamb · 2025-06-17T11:16:00Z

Let's get this one in and keep iterating. THank you @scovich

scovich added 2 commits June 12, 2025 18:53

Implement Variant::Object

04bedf9

Implement Variant::List (renamed from Variant::Array)

49768f3

github-actions bot added the parquet Changes to the parquet crate label Jun 13, 2025

github-actions bot added the arrow Changes to the arrow crate label Jun 16, 2025

alamb force-pushed the variant-object branch from beb966c to 49768f3 Compare June 16, 2025 17:46

github-actions bot removed the arrow Changes to the arrow crate label Jun 16, 2025

Merge remote-tracking branch 'apache/main' into variant-object

1f4ab8b

Merge remote-tracking branch 'apache/main' into variant-object

8cc9d05

alamb mentioned this pull request Jun 16, 2025

Variant: Write Variant Values as JSON #7670

Open

alamb reviewed Jun 16, 2025

View reviewed changes

alamb mentioned this pull request Jun 16, 2025

Add varant_interop tests for objects and lists/arrays scovich/arrow-rs#1

Merged

alamb reviewed Jun 16, 2025

View reviewed changes

alamb approved these changes Jun 16, 2025

View reviewed changes

Add varant_interop tests for objects and lists/arrays (#1)

480ef5d

scovich mentioned this pull request Jun 16, 2025

it would be great eventually to cover these error cases with tests too (aka verify invalid inputs). I don't think it is needed for this PR #7681

Open

address review feedback

bdea68a

alamb approved these changes Jun 17, 2025

View reviewed changes

alamb merged commit f5f09ea into apache:main Jun 17, 2025
12 checks passed

This was referenced Jun 17, 2025

[Variant] Consider validating variants on creation (rather than read) #7684

Open

[Variant] Improve API for iterating over values of a VariantList #7685

Open

-        let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {
+        if !self.metadata.is_sorted() {
+            return Err(ArrowError::NotYetImplemented(
+                "Cannot search for fields in an unsorted VariantObject".to_string(),
+            ));
+        }
+        let search_result = try_binary_search_by(&field_ids, &name, |&field_id| {

-// NOTE: We differ from the variant spec and call it "list" instead of "array" in order to be
-// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
-// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.
+/// Represents an Variant `Array`
+///
+/// NOTE: The `List` naming differs from the variant spec, which uses "array" in order to be
+/// consistent with parquet and arrow type naming. Otherwise, the name would conflict with the
+/// `VariantArray : Array` we must eventually define for variant-typed arrow arrays.

Finish implementing Variant::Object and Variant::List #7666

Finish implementing Variant::Object and Variant::List #7666

Uh oh!

Conversation

scovich commented Jun 13, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

scovich commented Jun 13, 2025

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

scovich commented Jun 16, 2025

Uh oh!

Choose a reason for hiding this comment

scovich commented Jun 13, 2025 •

edited by alamb

Loading

scovich Jun 16, 2025 •

edited

Loading