-
Notifications
You must be signed in to change notification settings - Fork 0
Add varant_interop tests for objects and lists/arrays #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add varant_interop tests for objects and lists/arrays #1
Conversation
self.header.values_start_byte + start_offset | ||
..self.header.values_start_byte + end_offset, | ||
)?; | ||
let value_bytes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spec says that the offsets may be non monotonically increasing, so the correct slice is all the subsequent bytes (even though fewer may be used)
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types
This implies that the field_offset values may not be monotonically increasing. For example, for the following object:
If I didn't make this change the tests asserted on me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).
I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus points if somebody can craft an overlapping pair of sub-objects that are completely valid. I suspect with nested objects it should be possible -- one sub-object would seem to have the other sub-object as both a child and a sibling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?
Yeah, and I am not sure what limiting the value slice would achieve anyways -- all the code that interprets variant values looks at the value header to know how many bytes of the value to look at. So if the slice is longer than
Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.
FWIW I don't think the spec prevents variant values from being reused (aka that the values of two sibling fields point to the same offset within the value.
The only requirement from what I can see is that the values pointed to by the value header are valid variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the slice is longer than
Incomplete sentence?
The only requirement from what I can see is that the values pointed to by the value header are valid variants.
It does seem that way, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the slice is longer than
Incomplete sentence?
Sorry -- what I meant was "if the slice is longer than needed, any remaining byte will be ignored"
assert!(variant_object.is_empty()); | ||
} | ||
#[test] | ||
fn variant_object_primitive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the PR was to add these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the integration test!
self.header.values_start_byte + start_offset | ||
..self.header.values_start_byte + end_offset, | ||
)?; | ||
let value_bytes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. And the problem would arise if the "next" offset (by field name) corresponds to a sub-object that's physically earlier in the layout. So we can't compute the size of the sub-object using offsets. Unless we're willing to search/sort the whole offset list to find the upper bound (= 🙀).
I suppose we could at least limit the slice (in the common case) by using the "next" offset only when it's not smaller than the starting offset. In that case, it should be a safe upper bound. But I'm not sure how helpful it would actually be, given that no invariant is reliably enforced?
self.header.values_start_byte + start_offset | ||
..self.header.values_start_byte + end_offset, | ||
)?; | ||
let value_bytes = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... a buggy/malicious variant buffer could potentially lead to some "fun" here, where one sub-object refers to bytes shared by another sub-object. Hopefully at least one of the "overlapping" sub-objects would be obviously invalid in a way we could detect, but there's no guarantee of that.
Which issue does this PR close?
Rationale for this change
This PR adds end to end integration tests for @scovich 's PR
When implementing variant object support it is important to make sure we can read what spark wrote, so I updated the variant_interop test to do so
What changes are included in this PR?
Are there any user-facing changes?