-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine documentation to Array::is_null
#4838
Conversation
arrow-array/src/array/mod.rs
Outdated
/// let array = NullArray::new(1); | ||
/// assert_eq!(array.is_logical_null(0), true); | ||
/// ``` | ||
fn is_logical_null(&self, index: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at the very least we should provide an efficient implementation of this, instead of computing logical_nulls which could be very expensive.
In general I am really not a fan of adding this method, it is a fairly major potential performance footgun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
/// | ||
/// // NullArrays do not have a validity mask | ||
/// let array = NullArray::new(1); | ||
/// assert_eq!(array.is_null(0), false); | ||
/// ``` | ||
fn is_null(&self, index: usize) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've said this before but I think from a user PoV, having is_null
and is_logical_null
is confusing as hell. Which NULL is is_null
?! Yeah, historically this is the physical null but do most users really care about the physical repr.? I would argue that at least this method should be called is_physical_null
to force users to think about what kind of null they want, instead of tricking them into using the wrong implicit default for their use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that is_null should always return logical nullability? What about for RunArray where this would have O(log(n))
complexity? What about null_count? The only consistent thing I can see is to only ever return physical nullability...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that is_null
should be renamed to is_physical_null
(potentially w/ a soft deprecation period) to avoid that users accidentally pick the wrong method.
You make a good point regarding null_count
. My argument would be: rename that one as well, to physical_null_count
. Then it's clear to which semantic you're referring to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #4840 to track
@tustvold and I talked about this earlier today and what I suggest is:
|
Array::is_null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some minor copy alterations, although one of the comments was incorrect
Co-authored-by: Raphael Taylor-Davies <[email protected]>
* Add documentation and Array::is_logical_null * Remove code change, refine comments * fix docs * Apply suggestions from code review Co-authored-by: Raphael Taylor-Davies <[email protected]> * Fix link formatting --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #4835
Rationale for this change
The fact that
NullArray::is_null()
returns false is both consistent with the technical definition of physical/logical nulls, but also deeply confusing to a casual user, as explained on #4835I don't think the implications of logical vs physical nullability are well understood by the arrow user community (and to be honest I am not sure they should in most cases).
Thus helping them find the right API for what they want to do would be incel
What changes are included in this PR?
NullArray
case (the one where I think it is the most deeply confusing, even though this does potentially apply toDictionaryArray
andRunArray
Array::is_logical_null
that returns the the logical nullability (mostly as a way to document the behavior and save downstream crates from having to handleNullArray
specially)Are there any user-facing changes?
New function and improved documentation
Questions
Array::is_logical_null
Array::is_logical_valid
to mirrorArray::is_valid
?