Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Array Logical Nullability #4691

Merged
merged 1 commit into from
Aug 14, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Aug 12, 2023

Which issue does this PR close?

Closes #4690
Closes #4689
Closes #4688
Closes #2687
Closes #4616
Relates to #4565
Relates to #4835

Rationale for this change

Various types have nullability that is not accurately reflected in their reported NullBuffer, this causes incorrect behaviour for various kernels that have special handling of the null buffers.

What changes are included in this PR?

Adds two methods to Array:

  • Array::logical_nulls - which computes the logical null mask
  • Array::is_nullable - which only returns false if an array cannot contain nulls

This logical null mask can then be used in the places where the logic relies on logical nullability

Are there any user-facing changes?

@tustvold tustvold added the api-change Changes to the arrow API label Aug 12, 2023
@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Aug 12, 2023
@@ -311,11 +311,7 @@ pub fn not(left: &BooleanArray) -> Result<BooleanArray, ArrowError> {
/// assert_eq!(a_is_null, BooleanArray::from(vec![false, false, true]));
/// ```
pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> {
let values = match input.nulls() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to special case NullArray, whilst also now correctly handling DictionaryArray with value nulls

@@ -1253,4 +1286,20 @@ mod tests {
assert_eq!(v, expected, "{idx}");
}
}

#[test]
fn test_iterator_nulls() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #4616

/// are present. For example a [`DictionaryArray`] with nullable values will still return true,
/// even if the nulls present in [`DictionaryArray::values`] are not referenced by any key,
/// and therefore would not appear in [`Array::logical_nulls`].
fn is_nullable(&self) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated calling this method is_required instead, but figured it was better to stick with consistent terminology, and this matches the method on Field

@@ -107,22 +109,12 @@ impl Array for NullArray {
None
}

/// Returns whether the element at `index` is null.
/// All elements of a `NullArray` are always null.
fn is_null(&self, _index: usize) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, although I expect the impact to minimal. I debated making is_null and friends always return logical nullability, but this would have been inconsistent with ArrayData and generally seemed like it might just cause more confusion


#[test]
#[cfg(feature = "dyn_cmp_dict")]
fn test_dictionary_nested_nulls() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test for #4688

@@ -4016,4 +4016,30 @@ mod tests {
vec![None, None, None, Some(5.1), Some(5.1), Some(3.0), Some(1.2)],
);
}

#[test]
fn test_lexicographic_comparator_null_dict_values() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #2687

}

#[test]
fn test_null_dictionary_values() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for #4690

@tustvold tustvold force-pushed the logical-nullability branch from ab28cfe to 721d65e Compare August 12, 2023 14:03
@tustvold tustvold force-pushed the logical-nullability branch from 721d65e to 55a420b Compare August 12, 2023 14:04
@@ -277,6 +317,10 @@ impl Array for ArrayRef {
self.as_ref().nulls()
}

fn logical_nulls(&self) -> Option<NullBuffer> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered having this eagerly constructed and stored on the arrays, but I felt this would be more surprising as it would make the arrays diverge from the arrow specification. Whilst this does raise the prospect of computing this NullBuffer multiple times, the cases where this is required are fairly rare. We can always optimise in future should this become a problem

@alamb alamb changed the title Logical Nullability Improve Array Logical Nullability Aug 13, 2023
fn nulls(&self) -> Option<&NullBuffer>;

/// Returns the logical null buffer of this array if any
///
/// In most cases this will be the same as [`Array::nulls`], except for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Array::nulls should be deprecated and a non-deprecated version physical_nulls should be introduced. Users will misuse nulls because:

  • most people will read over the small note within the doc string
  • nulls is easier to find than logical_nulls
  • there are two options of "which null", and we offer an implicit default (because it is not prefixed with a semantic) that is clearly the wrong one for many use cases

Copy link
Contributor Author

@tustvold tustvold Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst I do sort of agree, I'm not sure this wouldn't just cause more confusion, as this isn't a distinction found in the arrow spec. Would we also deprecate is_null, is_valid, etc... I think the current API hasn't been a major issue thus far, as the cases where it matters are rare.

FWIW once we have StringView I think there is basically no reason to use DictionaryArray or RunArray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
2 participants