-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Physical null and logical null are confusing concepts #4840
Comments
I'd like to provide some though from a user perspective. When processing data where null is very common, it's natural to looking for a way to reduce the comsuption of those null values. As it's known that some part of the data is missing and we of cause can make optimization based on that. But I find it seems to be difficult at present. I can't use (BTW, if I want write this part of data to parquet, or passing/compute it under a given schema, I can only build a corresponding array, and fills From whether the type is null and whether the value is null, we can give four (!!) types of null. When the type is null, test function like By listing them down, some questions come to my mind:
Physical and logical null are truly confusing. But it there any way to make it intuitive and easy to use 🤔 |
I agree @waynexia I don't quite follow your examples with 4 different types of null. If you need to quickly create an array that represents entirely null of a single type, there is It is fast to check for an array that contains only nulls by using let num_nulls = array.nulls().map(|nulls| nulls.null_count()).unwrap_or(0)
let is_all_null = num_nulls = array.len(); Though as the docs say, this isn't correct for Dictionary or REE arrays |
Yes, I'm considering the "strictly necessary" consumption. API could be something formed later.
Sorry for my unclear description 🥲 Let me make another try. Here is a table of those four types:
Arrays created by The problem is What confuses me is From my perspective as a user, an array like In many scenarios I met, |
This might be a little off-topic. I'd like to propose a new array similar to I don't know if it's still an option to not have "logical null" and "physical null". Maybe overriding |
This sounds very much like what datafusion
I think the I made #5434 to try and clarify this even more |
I just lost an hour to this. The current API is miserable, and relying on users to notice the caveat about physical null in the docs under the prefix "Note: For performance reasons" will not work. You've documented correctness in a subsection about performance! Conflating what is stored with what it means is a huge logical error. https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html (Also, https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_null is also a miserable method name using a third definition of what "null" means.) |
We would welcome any suggestions on how to improve the situation. I agree it is confusing (hence this ticket) |
Yes. I'm keeping notes on my experiences and hope to suggest some stability-breaking changes once I feel like I understand enough. However, I have a datafusion-based OLTP database tantalizingly close to public alpha so I'm a bit distracted by that shiny goal ;-) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The arrow-rs library (now) makes a distinction between physical nulls and logical nulls as the same distinction is made in the Arrow specification (though the terms physical and logical nulls are not used, to my knoweldge)
The issue is that for certain array types computing if an element is very fast (consult a pre-existing bitmap) but for others can be quite slow (e.g. a dictionary where both the keys and values must be consulted for nullness)
The method named
Array::is_null
returns the (fast) physical nullness, but is deeply confusing for for certain types -- see #4835 and #4838 (comment) from @crepererum.We have tried to clarify the difference in #4838 but it is still confusing
Describe the solution you'd like
I am not sure -- @crepererum suggests in #4838 (comment)
However, there are downsides to this too
Describe alternatives you've considered
The documentation changes may be enough, but I think the issue is important enough to track here
Additional context
The text was updated successfully, but these errors were encountered: