Add varant_interop tests for objects and lists/arrays #1

alamb · 2025-06-16T18:59:56Z

Which issue does this PR close?

Rationale for this change

This PR adds end to end integration tests for @scovich 's PR

Finish implementing Variant::Object and Variant::List apache/arrow-rs#7666

When implementing variant object support it is important to make sure we can read what spark wrote, so I updated the variant_interop test to do so

What changes are included in this PR?

New tests
Fix a bug that was found by the new tests

Are there any user-facing changes?

alamb · 2025-06-16T19:01:02Z

parquet-variant/src/variant.rs

-                self.header.values_start_byte + start_offset
-                    ..self.header.values_start_byte + end_offset,
-            )?;
+            let value_bytes =


The spec says that the offsets may be non monotonically increasing, so the correct slice is all the subsequent bytes (even though fewer may be used)

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

This implies that the field_offset values may not be monotonically increasing. For example, for the following object:

If I didn't make this change the tests asserted on me

Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

Bonus points if somebody can craft an overlapping pair of sub-objects that are completely valid. I suspect with nested objects it should be possible -- one sub-object would seem to have the other sub-object as both a child and a sibling.

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

Yeah, and I am not sure what limiting the value slice would achieve anyways -- all the code that interprets variant values looks at the value header to know how many bytes of the value to look at. So if the slice is longer than

Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

FWIW I don't think the spec prevents variant values from being reused (aka that the values of two sibling fields point to the same offset within the value.

The only requirement from what I can see is that the values pointed to by the value header are valid variants.

So if the slice is longer than

Incomplete sentence?

The only requirement from what I can see is that the values pointed to by the value header are valid variants.

It does seem that way, yes.

So if the slice is longer than

Incomplete sentence?

Sorry -- what I meant was "if the slice is longer than needed, any remaining byte will be ignored"

alamb · 2025-06-16T19:01:15Z

parquet-variant/tests/variant_interop.rs

+    assert!(variant_object.is_empty());
+}
+#[test]
+fn variant_object_primitive() {


The point of the PR was to add these tests

scovich

Thanks for adding the integration test!

scovich · 2025-06-16T19:31:11Z

parquet-variant/src/variant.rs

-                self.header.values_start_byte + start_offset
-                    ..self.header.values_start_byte + end_offset,
-            )?;
+            let value_bytes =


Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

scovich · 2025-06-16T19:36:23Z

parquet-variant/src/variant.rs

-                self.header.values_start_byte + start_offset
-                    ..self.header.values_start_byte + end_offset,
-            )?;
+            let value_bytes =


Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

Add varant_interop tests for objects and lists/arrays

d7f75cb

alamb commented Jun 16, 2025

View reviewed changes

alamb mentioned this pull request Jun 16, 2025

Finish implementing Variant::Object and Variant::List apache/arrow-rs#7666

Merged

scovich reviewed Jun 16, 2025

View reviewed changes

scovich merged commit 480ef5d into scovich:variant-object Jun 16, 2025
1 check failed

alamb deleted the alamb/variant-object-tests branch June 17, 2025 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add varant_interop tests for objects and lists/arrays #1

Add varant_interop tests for objects and lists/arrays #1

Uh oh!

alamb commented Jun 16, 2025

Uh oh!

alamb Jun 16, 2025

Uh oh!

scovich Jun 16, 2025

Uh oh!

scovich Jun 16, 2025

Uh oh!

scovich Jun 16, 2025

Uh oh!

alamb Jun 16, 2025

Uh oh!

scovich Jun 16, 2025

Uh oh!

alamb Jun 17, 2025

Uh oh!

alamb Jun 16, 2025

Uh oh!

scovich left a comment

Uh oh!

scovich Jun 16, 2025

Uh oh!

scovich Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

Add varant_interop tests for objects and lists/arrays #1

Add varant_interop tests for objects and lists/arrays #1

Uh oh!

Conversation

alamb commented Jun 16, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!