Skip to content

[Variant] Avoid superflous validation checks #7906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions parquet-variant/src/variant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1199,8 +1199,20 @@ impl TryFrom<(i128, u8)> for Variant<'_, '_> {

#[cfg(test)]
mod tests {

use super::*;

#[test]
fn test_empty_variant_will_fail() {
let metadata = VariantMetadata::try_new(&[1, 0, 0]).unwrap();

let err = Variant::try_new_with_metadata(metadata, &[]).unwrap_err();

assert!(matches!(
err,
ArrowError::InvalidArgumentError(ref msg) if msg == "Received empty bytes"));
}

#[test]
fn test_construct_short_string() {
let short_string = ShortString::try_new("norm").expect("should fit in short string");
Expand Down
27 changes: 9 additions & 18 deletions parquet-variant/src/variant/list.rs
Original file line number Diff line number Diff line change
Expand Up @@ -216,28 +216,19 @@ impl<'m, 'v> VariantList<'m, 'v> {
self.header.first_offset_byte() as _..self.first_value_byte as _,
)?;

let offsets =
map_bytes_to_offsets(offset_buffer, self.header.offset_size).collect::<Vec<_>>();

// Validate offsets are in-bounds and monotonically increasing.
// Since shallow verification checks whether the first and last offsets are in-bounds,
// we can also verify all offsets are in-bounds by checking if offsets are monotonically increasing.
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
return Err(ArrowError::InvalidArgumentError(
"offsets are not monotonically increasing".to_string(),
));
}

let value_buffer = slice_from_slice(self.value, self.first_value_byte as _..)?;

// Validate whether values are valid variant objects
for i in 1..offsets.len() {
let start_offset = offsets[i - 1];
let end_offset = offsets[i];

let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?;
//
// Since we use offsets to slice into the value buffer, this also verifies all offsets are in-bounds
// and monotonically increasing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow how this check ensures the offsets are monotonically increasing

Is it because slice_from_slice requires the bounds to be increasing?

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was my thinking

slice_from_slice will err if the slice range is invalid. A slice range is invalid when the start offset is greater than the end offset.

So when we iterate through the offsets, we build our slice range offsets[i]..offsets[i + 1]. If every slice attempt is successful, we can guarantee offsets are non-decreasing.

We still need to check if offsets[i] == offsets[i + 1] for any i. This is still a valid range and slice_from_slice will return a valid slice (empty bytes).

This is when the Variant::try_new_with_metadata will err.

 #[test]
fn foo() {
    let metadata = VariantMetadata::try_new(&[1, 0, 0]).unwrap();

    // the following will panic:
    // called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
    let v = Variant::try_new_with_metadata(metadata, &[]).unwrap();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the example above as a test case in a5a5e45

let mut offset_iter = map_bytes_to_offsets(offset_buffer, self.header.offset_size);
let mut current_offset = offset_iter.next().unwrap_or(0);

for next_offset in offset_iter {
let value_bytes = slice_from_slice(value_buffer, current_offset..next_offset)?;
Variant::try_new_with_metadata(self.metadata.clone(), value_bytes)?;
current_offset = next_offset;
}

self.validated = true;
Expand Down
25 changes: 15 additions & 10 deletions parquet-variant/src/variant/metadata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -237,22 +237,15 @@ impl<'m> VariantMetadata<'m> {
let offsets =
map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>();
Comment on lines 237 to 238
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still tracking a TODO to eliminate this materialization?
Once the comment below is addressed, I think it's the only one left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are tracking #7901


// Validate offsets are in-bounds and monotonically increasing.
// Since shallow validation ensures the first and last offsets are in bounds, we can also verify all offsets
// are in-bounds by checking if offsets are monotonically increasing.
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
return Err(ArrowError::InvalidArgumentError(
"offsets not monotonically increasing".to_string(),
));
}

// Verify the string values in the dictionary are UTF-8 encoded strings.
let value_buffer =
string_from_slice(self.bytes, 0, self.first_value_byte as _..self.bytes.len())?;

if self.header.is_sorted {
// Validate the dictionary values are unique and lexicographically sorted
//
// Since we use the offsets to access dictionary values, this also validates
// offsets are in-bounds and monotonically increasing
let are_dictionary_values_unique_and_sorted = (1..offsets.len())
.map(|i| {
let field_range = offsets[i - 1]..offsets[i];
Expand All @@ -268,6 +261,18 @@ impl<'m> VariantMetadata<'m> {
"dictionary values are not unique and ordered".to_string(),
));
}
} else {
// Validate offsets are in-bounds and monotonically increasing
//
// Since shallow validation ensures the first and last offsets are in bounds,
// we can also verify all offsets are in-bounds by checking if
// offsets are monotonically increasing
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
Comment on lines +270 to +271
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the extra let is helpful?

Suggested change
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
if !offsets.is_sorted_by(|a, b| a < b) {

return Err(ArrowError::InvalidArgumentError(
"offsets not monotonically increasing".to_string(),
));
}
}

self.validated = true;
Expand Down
15 changes: 2 additions & 13 deletions parquet-variant/src/variant/object.rs
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,8 @@ impl<'m, 'v> VariantObject<'m, 'v> {
} else {
// The metadata dictionary can't guarantee uniqueness or sortedness, so we have to parse out the corresponding field names
// to check lexicographical order
//
// Since we are probing the metadata dictionary by field id, this also verifies field ids are in-bounds
let are_field_names_sorted = field_ids
.iter()
Comment on lines 247 to 248
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only make a single pass now, so we no longer need to collect field ids into a vec. The only non-trivial tweak is to request the last field id specifically for the field id bounds check -- O(1) cost, so no need to materialize a whole just vec for that.

While you're at it, consider replacing the collect + is_sorted pair just below with just Iterator::is_sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, after we create an iterator of offsets, we immediately split into two different validation paths. I'm a bit unsure how to best handle this while avoiding the allocation.

I opened #7901 to track areas where we can avoid materialization.

.map(|&i| self.metadata.get(i))
Expand All @@ -253,19 +255,6 @@ impl<'m, 'v> VariantObject<'m, 'v> {
"field names not sorted".to_string(),
));
}

// Since field ids are not guaranteed to be sorted, scan over all field ids
// and check that field ids are less than dictionary size

let are_field_ids_in_bounds = field_ids
.iter()
.all(|&id| id < self.metadata.dictionary_size());

if !are_field_ids_in_bounds {
return Err(ArrowError::InvalidArgumentError(
"field id is not valid".to_string(),
));
}
}

// Validate whether values are valid variant objects
Expand Down
Loading