-
Notifications
You must be signed in to change notification settings - Fork 975
[Variant] Avoid superflous validation checks #7906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -237,22 +237,15 @@ impl<'m> VariantMetadata<'m> { | |||||||
let offsets = | ||||||||
map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>(); | ||||||||
Comment on lines
237
to
238
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we still tracking a TODO to eliminate this materialization? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are tracking #7901 |
||||||||
|
||||||||
// Validate offsets are in-bounds and monotonically increasing. | ||||||||
// Since shallow validation ensures the first and last offsets are in bounds, we can also verify all offsets | ||||||||
// are in-bounds by checking if offsets are monotonically increasing. | ||||||||
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); | ||||||||
if !are_offsets_monotonic { | ||||||||
return Err(ArrowError::InvalidArgumentError( | ||||||||
"offsets not monotonically increasing".to_string(), | ||||||||
)); | ||||||||
} | ||||||||
|
||||||||
// Verify the string values in the dictionary are UTF-8 encoded strings. | ||||||||
let value_buffer = | ||||||||
string_from_slice(self.bytes, 0, self.first_value_byte as _..self.bytes.len())?; | ||||||||
|
||||||||
if self.header.is_sorted { | ||||||||
// Validate the dictionary values are unique and lexicographically sorted | ||||||||
// | ||||||||
// Since we use the offsets to access dictionary values, this also validates | ||||||||
// offsets are in-bounds and monotonically increasing | ||||||||
let are_dictionary_values_unique_and_sorted = (1..offsets.len()) | ||||||||
.map(|i| { | ||||||||
let field_range = offsets[i - 1]..offsets[i]; | ||||||||
|
@@ -268,6 +261,18 @@ impl<'m> VariantMetadata<'m> { | |||||||
"dictionary values are not unique and ordered".to_string(), | ||||||||
)); | ||||||||
} | ||||||||
} else { | ||||||||
// Validate offsets are in-bounds and monotonically increasing | ||||||||
// | ||||||||
// Since shallow validation ensures the first and last offsets are in bounds, | ||||||||
// we can also verify all offsets are in-bounds by checking if | ||||||||
// offsets are monotonically increasing | ||||||||
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b); | ||||||||
if !are_offsets_monotonic { | ||||||||
Comment on lines
+270
to
+271
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure the extra
Suggested change
|
||||||||
return Err(ArrowError::InvalidArgumentError( | ||||||||
"offsets not monotonically increasing".to_string(), | ||||||||
)); | ||||||||
} | ||||||||
} | ||||||||
|
||||||||
self.validated = true; | ||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -242,6 +242,8 @@ impl<'m, 'v> VariantObject<'m, 'v> { | |
} else { | ||
// The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names | ||
// to check lexicographical order | ||
// | ||
// Since we are probing the metadata dictionary by field id, this also verifies field ids are in-bounds | ||
let are_field_names_sorted = field_ids | ||
.iter() | ||
Comment on lines
247
to
248
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We only make a single pass now, so we no longer need to collect field ids into a vec. The only non-trivial tweak is to request the last field id specifically for the field id bounds check -- O(1) cost, so no need to materialize a whole just vec for that. While you're at it, consider replacing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, after we create an iterator of offsets, we immediately split into two different validation paths. I'm a bit unsure how to best handle this while avoiding the allocation. I opened #7901 to track areas where we can avoid materialization. |
||
.map(|&i| self.metadata.get(i)) | ||
|
@@ -253,19 +255,6 @@ impl<'m, 'v> VariantObject<'m, 'v> { | |
"field names not sorted".to_string(), | ||
)); | ||
} | ||
|
||
// Since field ids are not guaranteed to be sorted, scan over all field ids | ||
// and check that field ids are less than dictionary size | ||
|
||
let are_field_ids_in_bounds = field_ids | ||
.iter() | ||
.all(|&id| id < self.metadata.dictionary_size()); | ||
|
||
if !are_field_ids_in_bounds { | ||
return Err(ArrowError::InvalidArgumentError( | ||
"field id is not valid".to_string(), | ||
)); | ||
} | ||
} | ||
|
||
// Validate whether values are valid variant objects | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite follow how this check ensures the offsets are monotonically increasing
Is it because slice_from_slice requires the bounds to be increasing?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this was my thinking
slice_from_slice
will err if the slice range is invalid. A slice range is invalid when the start offset is greater than the end offset.So when we iterate through the offsets, we build our slice range
offsets[i]..offsets[i + 1]
. If every slice attempt is successful, we can guarantee offsets are non-decreasing.We still need to check if
offsets[i] == offsets[i + 1]
for anyi
. This is still a valid range andslice_from_slice
will return a valid slice (empty bytes).This is when the
Variant::try_new_with_metadata
will err.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the example above as a test case in a5a5e45