Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Physical null and logical null are confusing concepts #4840

Open
alamb opened this issue Sep 19, 2023 · 8 comments
Open

Physical null and logical null are confusing concepts #4840

alamb opened this issue Sep 19, 2023 · 8 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Sep 19, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The arrow-rs library (now) makes a distinction between physical nulls and logical nulls as the same distinction is made in the Arrow specification (though the terms physical and logical nulls are not used, to my knoweldge)

The issue is that for certain array types computing if an element is very fast (consult a pre-existing bitmap) but for others can be quite slow (e.g. a dictionary where both the keys and values must be consulted for nullness)

The method named Array::is_null returns the (fast) physical nullness, but is deeply confusing for for certain types -- see #4835 and #4838 (comment) from @crepererum.

We have tried to clarify the difference in #4838 but it is still confusing

Describe the solution you'd like
I am not sure -- @crepererum suggests in #4838 (comment)

I would argue that at least this method should be called is_physical_null to force users to think about what kind of null they want, instead of tricking them into using the wrong implicit default for their use case.

However, there are downsides to this too

Describe alternatives you've considered
The documentation changes may be enough, but I think the issue is important enough to track here

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Sep 19, 2023
@waynexia
Copy link
Member

I'd like to provide some though from a user perspective. When processing data where null is very common, it's natural to looking for a way to reduce the comsuption of those null values. As it's known that some part of the data is missing and we of cause can make optimization based on that.

But I find it seems to be difficult at present. I can't use NullArray for not only the type problem and the "logical or physical" problem, but also the parquet side, where requires the array must be the same type with schema. And in this scenario a Null type never occurs -- some other parts will have data. Here some data are "logical null", but I can't give the answer of whether it's "physical null" (or should I even consider it?).

(BTW, if I want write this part of data to parquet, or passing/compute it under a given schema, I can only build a corresponding array, and fills None one by one. This is costly comparing to how a NullArray works.)

From whether the type is null and whether the value is null, we can give four (!!) types of null. When the type is null, test function like is_null() gives true when the value presents and is null (a), and gives false when the value is missing (b). And when the type is others, the null value of cause is_null() (c) and non-null value is not is_null() (d). Please correct me if this is not correct.

By listing them down, some questions come to my mind:

  • Is it really necessary to distinguish case (a) and (b)? I have to use a new word "present" to say the difference.
  • Comparing case (a) and (c), does it means we have the fifth type of null that the type is not null but value "doesn't present"?
  • Null value should be a wildcard value, as it can fit into other types (case c). This is done by letting None to be a valid value for array.
  • We should have two kinds of null array. One for (a) and (b) where the type is null, and another one for (c) where the array is a compond array.

Physical and logical null are truly confusing. But it there any way to make it intuitive and easy to use 🤔

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2024

Physical and logical null are truly confusing.

I agree @waynexia

I don't quite follow your examples with 4 different types of null.

If you need to quickly create an array that represents entirely null of a single type, there is new_null_array, but as you seem to imply that certainly results in a larger array than strictly necessary

It is fast to check for an array that contains only nulls by using Array::nulls I think like:

let num_nulls = array.nulls().map(|nulls| nulls.null_count()).unwrap_or(0)
let is_all_null = num_nulls = array.len();

Though as the docs say, this isn't correct for Dictionary or REE arrays

@waynexia
Copy link
Member

If you need to quickly create an array that represents entirely null of a single type, there is new_null_array, but as you seem to imply that certainly results in a larger array than strictly necessary

Yes, I'm considering the "strictly necessary" consumption. API could be something formed later. new_null_array which then calls to ArrayData::new_null would allocate memory for those null values, as it's a normal array. Those consumptions are not necessary if we can take the precondition that "all the values are null". Like NullArray or RunArray.

I don't quite follow your examples with 4 different types of null.

Sorry for my unclear description 🥲 Let me make another try. Here is a table of those four types:

type is NULL other types
value is NULL (a) (c)
other values (b) (d)

Arrays created by new_null_array() is (c) and other normal arrays are (d). Definitions of these two types are clear IMO.

The problem is (a) and (b). NullArray is closer to (b) from its API behavior, where NullArray::is_null() always returns false, implying the type is NULL but there are values. And NullArray::null_count() is also always 0. But since we can't get a value from it like other arrays, we don't know what's the value exactly.

What confuses me is NullArray::data_type gives Null. This (to my understanding) is in conflict with previous behavior. What should we return if we implement something like NullArray::value() that can meet both requirements (type is Null and is not is_null()? This might be the place to introduce "physical" and "logical" null.

From my perspective as a user, an array like (a) might be more straightforward and intuitive. But (b) on the contrary, seems to suit fewer cases.

In many scenarios I met, RunArray works more like a NullArray I expect -- I can give either Null or Int32 data type to it, and retrieve values by indices. Despite is_null() / nulls() / is_valid() etc are the same with NullArray.

@waynexia
Copy link
Member

This might be a little off-topic. I'd like to propose a new array similar to RunArray but only accepts one value in construction, like SingularArray. The difference to RunArray is logic as value() can be simplified, and can have different behavior on null-related APIs (e.g., override the default impl of is_valid()/is_null()). And it can express both (a) & (b) (though I think it's not necessary to distinguish these two types...).

I don't know if it's still an option to not have "logical null" and "physical null". Maybe overriding is_valid() and is_null() can have a slight help toward it? Adding a new array means lots of work, I'm not sure if this is viable, please let me know your thought.

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2024

This might be a little off-topic. I'd like to propose a new array similar to RunArray but only accepts one value in construction, like SingularArray. The difference to RunArray is logic as value() can be simplified, and can have different behavior on null-related APIs (e.g., override the default impl of is_valid()/is_null()). And it can express both (a) & (b) (though I think it's not necessary to distinguish these two types...).

This sounds very much like what datafusion ScalarValue and Datum are designed to do 🤔 -- I bet there would be interest on the arrow mailing list as well (they may have been prior discussion about it too)

I don't know if it's still an option to not have "logical null" and "physical null". Maybe overriding is_valid() and is_null() can have a slight help toward it? Adding a new array means lots of work, I'm not sure if this is viable, please let me know your thought.

I think the logical and physical nulls refer to how the nulls are encoded in the Arrow arrays themselves. Given that this library is designed as a low level API for Arrow arrays, I believe the rationale is that exposing the null buffers directly as arrow encodes them provides the most control

I made #5434 to try and clarify this even more

@tv42
Copy link

tv42 commented Dec 6, 2024

I just lost an hour to this. The current API is miserable, and relying on users to notice the caveat about physical null in the docs under the prefix "Note: For performance reasons" will not work. You've documented correctness in a subsection about performance! Conflating what is stored with what it means is a huge logical error.

https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html

(Also, https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_null is also a miserable method name using a third definition of what "null" means.)

@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2024

We would welcome any suggestions on how to improve the situation. I agree it is confusing (hence this ticket)

@tv42
Copy link

tv42 commented Dec 6, 2024

Yes. I'm keeping notes on my experiences and hope to suggest some stability-breaking changes once I feel like I understand enough. However, I have a datafusion-based OLTP database tantalizingly close to public alpha so I'm a bit distracted by that shiny goal ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants