-
Notifications
You must be signed in to change notification settings - Fork 970
[Variant] Avoid superflous validation checks #7906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Avoid superflous validation checks #7906
Conversation
1f61faf
to
b48ae4d
Compare
b48ae4d
to
a15e0e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- I have one question but otherwise looks great
let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?; | ||
// | ||
// Since we use offsets to slice into the value buffer, we can also verify all offsets are in-bounds | ||
// and monotonically increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow how this check ensures the offsets are monotonically increasing
Is it because slice_from_slice requires the bounds to be increasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was my thinking
slice_from_slice
will err if the slice range is invalid. A slice range is invalid when the start offset is greater than the end offset.
So when we iterate through the offsets, we build our slice range offsets[i]..offsets[i + 1]
. If every slice attempt is successful, we can guarantee offsets are non-decreasing.
We still need to check if offsets[i] == offsets[i + 1]
for any i
. This is still a valid range and slice_from_slice
will return a valid slice (empty bytes).
This is when the Variant::try_new_with_metadata
will err.
#[test]
fn foo() {
let metadata = VariantMetadata::try_new(&[1, 0, 0]).unwrap();
// the following will panic:
// called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
let v = Variant::try_new_with_metadata(metadata, &[]).unwrap();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the example above as a test case in a5a5e45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); | ||
if !are_offsets_monotonic { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure the extra let
is helpful?
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); | |
if !are_offsets_monotonic { | |
if !offsets.is_sorted_by(|a, b| a < b) { |
let are_field_names_sorted = field_ids | ||
.iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only make a single pass now, so we no longer need to collect field ids into a vec. The only non-trivial tweak is to request the last field id specifically for the field id bounds check -- O(1) cost, so no need to materialize a whole just vec for that.
While you're at it, consider replacing the collect
+ is_sorted
pair just below with just Iterator::is_sorted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, after we create an iterator of offsets, we immediately split into two different validation paths. I'm a bit unsure how to best handle this while avoiding the allocation.
I opened #7901 to track areas where we can avoid materialization.
let offsets = | ||
map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still tracking a TODO to eliminate this materialization?
Once the comment below is addressed, I think it's the only one left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are tracking #7901
2650ea0
to
b7ee880
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, looking forward to the upcoming materialization fixes as well.
Thanks again @friendlymatthew and @scovich |
Which issue does this PR close?
Rational for this change
We can avoid certain checks in the validation code since other checks already guarantee these invariants for us