Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't recurse to children in ArrayData::try_new #3248

Merged
merged 3 commits into from
Dec 1, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

There is no safe way to construct an invalid ArrayData, additionally there is no way to safely construct ArrayData that doesn't perform validation.

The result is that when building nested data structures, the same ArrayData will be validated multiple times unnecessarily. I think this is a footgun that is best removed, if people really want to recursively validate, it is easy enough for them to do so explicitly

What changes are included in this PR?

Are there any user-facing changes?

@tustvold tustvold requested a review from alamb November 30, 2022 18:41
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 30, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why it is a footgun? I understand the performance argument

@tustvold
Copy link
Contributor Author

Perhaps footgun is the wrong word, it is a surprising performance hit that is impossible to workaround without using unsafe. Recursive validation isn't necessary for soundness, and just increases the likelihood that people reach for unsafe, with all its potential implications 😅

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I don't follow all of the C++ implementation https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L314 but it looks to be like it does recursive child validation

I am not familiar with the IPC / flight implementations to know if they create ArrayData's unsafely

So basically I worry about this change as I don't fully understand the implications

@tustvold
Copy link
Contributor Author

tustvold commented Nov 30, 2022

Perhaps then an alternative might be to make ArrayData::try_new not recursively validate? Basically I want a safe API to construct an ArrayData that doesn't validate the ArrayData I have already safely constructed, and therefore validated

Edit: C++ will be performing recursive validation because it doesn't force validation of the child ArrayData before the parent can be constructed like we do

Edit Edit: Once #3247 is merged there is no use of unsafe in arrow-flight, arrow-csv or arrow-ipc aside from the flatbuffer generated code.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2022

Perhaps then an alternative might be to make ArrayData::try_new not recursively validate? Basically I want a safe API to construct an ArrayData that doesn't validate the ArrayData I have already safely constructed, and therefore validated

I would be more comfortable with this (explicitly disabling recursive validation via ArrayData::try_new()) when we can clearly apply the argument that ArrayData is aways validated (either explicitly or via unsafe)

Edit Edit: Once #3247 is merged there is no use of unsafe in arrow-flight, arrow-csv or arrow-ipc aside from the flatbuffer generated code.

🎉

@tustvold tustvold changed the title Don't recurse to children in ArrayData::validate_full Don't recurse to children in ArrayData::try_new Nov 30, 2022

/// Validates the the null count is correct
pub fn validate_nulls(&self) -> Result<(), ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partly split out from #3244

///
/// This is equivalent to calling [`Self::validate_data`] on this [`ArrayData`]
/// and all its children recursively
pub fn validate_full(&self) -> Result<(), ArrowError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth highlighting that nothing now calls validate_full outside of tests, however, I think providing a "break glass in case of funky validate all the things" option is a good thing to have

// We don't need to validate children as we can assume that the
// [`ArrayData`] in `child_data` have already been validated through
// a call to `ArrayData::try_new` or created using unsafe
new_self.validate_data()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Dec 1, 2022

looks like a nice improvement

@tustvold tustvold merged commit c5c34fa into apache:master Dec 1, 2022
@ursabot
Copy link

ursabot commented Dec 1, 2022

Benchmark runs are scheduled for baseline = 961e114 and contender = c5c34fa. c5c34fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants