-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't recurse to children in ArrayData::try_new #3248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on why it is a footgun? I understand the performance argument
Perhaps footgun is the wrong word, it is a surprising performance hit that is impossible to workaround without using unsafe. Recursive validation isn't necessary for soundness, and just increases the likelihood that people reach for unsafe, with all its potential implications 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit I don't follow all of the C++ implementation https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L314 but it looks to be like it does recursive child validation
I am not familiar with the IPC / flight implementations to know if they create ArrayData's unsafely
So basically I worry about this change as I don't fully understand the implications
Perhaps then an alternative might be to make Edit: C++ will be performing recursive validation because it doesn't force validation of the child ArrayData before the parent can be constructed like we do Edit Edit: Once #3247 is merged there is no use of unsafe in arrow-flight, arrow-csv or arrow-ipc aside from the flatbuffer generated code. |
I would be more comfortable with this (explicitly disabling recursive validation via
🎉 |
|
||
/// Validates the the null count is correct | ||
pub fn validate_nulls(&self) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is partly split out from #3244
/// | ||
/// This is equivalent to calling [`Self::validate_data`] on this [`ArrayData`] | ||
/// and all its children recursively | ||
pub fn validate_full(&self) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perhaps worth highlighting that nothing now calls validate_full outside of tests, however, I think providing a "break glass in case of funky validate all the things" option is a good thing to have
// We don't need to validate children as we can assume that the | ||
// [`ArrayData`] in `child_data` have already been validated through | ||
// a call to `ArrayData::try_new` or created using unsafe | ||
new_self.validate_data()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
looks like a nice improvement |
Benchmark runs are scheduled for baseline = 961e114 and contender = c5c34fa. c5c34fa is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
There is no safe way to construct an invalid
ArrayData
, additionally there is no way to safely constructArrayData
that doesn't perform validation.The result is that when building nested data structures, the same ArrayData will be validated multiple times unnecessarily. I think this is a footgun that is best removed, if people really want to recursively validate, it is easy enough for them to do so explicitly
What changes are included in this PR?
Are there any user-facing changes?