Skip to content

[Variant] Avoid superflous validation checks #7906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 12, 2025

Conversation

friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rational for this change

We can avoid certain checks in the validation code since other checks already guarantee these invariants for us

@friendlymatthew friendlymatthew changed the title [Variant] Avoid superflous checks [Variant] Avoid superflous validation checks Jul 11, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/avoid-superflous-checks branch from 1f61faf to b48ae4d Compare July 11, 2025 15:45
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/avoid-superflous-checks branch from b48ae4d to a15e0e1 Compare July 11, 2025 15:53
@friendlymatthew
Copy link
Contributor Author

cc @scovich @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- I have one question but otherwise looks great

let value_bytes = slice_from_slice(value_buffer, start_offset..end_offset)?;
//
// Since we use offsets to slice into the value buffer, we can also verify all offsets are in-bounds
// and monotonically increasing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow how this check ensures the offsets are monotonically increasing

Is it because slice_from_slice requires the bounds to be increasing?

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was my thinking

slice_from_slice will err if the slice range is invalid. A slice range is invalid when the start offset is greater than the end offset.

So when we iterate through the offsets, we build our slice range offsets[i]..offsets[i + 1]. If every slice attempt is successful, we can guarantee offsets are non-decreasing.

We still need to check if offsets[i] == offsets[i + 1] for any i. This is still a valid range and slice_from_slice will return a valid slice (empty bytes).

This is when the Variant::try_new_with_metadata will err.

 #[test]
fn foo() {
    let metadata = VariantMetadata::try_new(&[1, 0, 0]).unwrap();

    // the following will panic:
    // called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Received empty bytes")
    let v = Variant::try_new_with_metadata(metadata, &[]).unwrap();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the example above as a test case in a5a5e45

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew

Comment on lines +268 to +269
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure the extra let is helpful?

Suggested change
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
if !offsets.is_sorted_by(|a, b| a < b) {

Comment on lines 247 to 248
let are_field_names_sorted = field_ids
.iter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only make a single pass now, so we no longer need to collect field ids into a vec. The only non-trivial tweak is to request the last field id specifically for the field id bounds check -- O(1) cost, so no need to materialize a whole just vec for that.

While you're at it, consider replacing the collect + is_sorted pair just below with just Iterator::is_sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, after we create an iterator of offsets, we immediately split into two different validation paths. I'm a bit unsure how to best handle this while avoiding the allocation.

I opened #7901 to track areas where we can avoid materialization.

Comment on lines 237 to 238
let offsets =
map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still tracking a TODO to eliminate this materialization?
Once the comment below is addressed, I think it's the only one left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are tracking #7901

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/avoid-superflous-checks branch from 2650ea0 to b7ee880 Compare July 11, 2025 21:18
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, looking forward to the upcoming materialization fixes as well.

@alamb alamb merged commit d8b5ef7 into apache:main Jul 12, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 12, 2025

Thanks again @friendlymatthew and @scovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Remove superfluous check when validating monotonic offsets
3 participants