Skip to content

Add varant_interop tests for objects and lists/arrays #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2025

Conversation

alamb
Copy link

@alamb alamb commented Jun 16, 2025

Which issue does this PR close?

Rationale for this change

This PR adds end to end integration tests for @scovich 's PR

When implementing variant object support it is important to make sure we can read what spark wrote, so I updated the variant_interop test to do so

What changes are included in this PR?

  1. New tests
  2. Fix a bug that was found by the new tests

Are there any user-facing changes?

self.header.values_start_byte + start_offset
..self.header.values_start_byte + end_offset,
)?;
let value_bytes =
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says that the offsets may be non monotonically increasing, so the correct slice is all the subsequent bytes (even though fewer may be used)

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

This implies that the field_offset values may not be monotonically increasing. For example, for the following object:

If I didn't make this change the tests asserted on me

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points if somebody can craft an overlapping pair of sub-objects that are completely valid. I suspect with nested objects it should be possible -- one sub-object would seem to have the other sub-object as both a child and a sibling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

Yeah, and I am not sure what limiting the value slice would achieve anyways -- all the code that interprets variant values looks at the value header to know how many bytes of the value to look at. So if the slice is longer than

Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

FWIW I don't think the spec prevents variant values from being reused (aka that the values of two sibling fields point to the same offset within the value.

The only requirement from what I can see is that the values pointed to by the value header are valid variants.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the slice is longer than

Incomplete sentence?

The only requirement from what I can see is that the values pointed to by the value header are valid variants.

It does seem that way, yes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the slice is longer than

Incomplete sentence?

Sorry -- what I meant was "if the slice is longer than needed, any remaining byte will be ignored"

assert!(variant_object.is_empty());
}
#[test]
fn variant_object_primitive() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the PR was to add these tests

Copy link
Owner

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the integration test!

self.header.values_start_byte + start_offset
..self.header.values_start_byte + end_offset,
)?;
let value_bytes =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).

I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?

self.header.values_start_byte + start_offset
..self.header.values_start_byte + end_offset,
)?;
let value_bytes =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.

@scovich scovich merged commit 480ef5d into scovich:variant-object Jun 16, 2025
1 check failed
@alamb alamb deleted the alamb/variant-object-tests branch June 17, 2025 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants