-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG (string dtype): looking up missing value in the Index fails #59879
Comments
PR #60329 fixes the main bug mentioned above ( But besides fixing the most obvious bug, there is still the question about how to deal with backwards compatibility. >>> pd.options.future.infer_string = False
# this shows `get_loc`, but this then applies to `ser[None]` or `None in idx` etc
>>> pd.Index(["a", "b", None]).get_loc(None)
2 because the index is inferred to be However, with the future string dtype enabled, we no longer infer it as object dtype, but as string dtype and at that point coerce any missing values (None, pd.NA) to NaN. As a result, if we are strict about the index lookup (only the exact sentinel gives a match), the same example as above but with the future string dtype enabled will fail: >>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None This is a backwards incompatible change for people that were using To preserve backwards compatibility, I think it would make sense to also coerce missing values to NaN as input for the indexer ( To make it more complex: first, how missing value sentinels are matched in indexing operations is quite inconsistent right now across dtypes (i.e. some other dtypes like datetime-like and categorical also coerce any missing value sentinel, and so are not strict), see #59765 for a detailed overview. Secondly, the above discussion is for scalar lookup ( >>> pd.options.future.infer_string = True
>>> idx = pd.Index(["a", "b", None]) # gets inferred as `str` dtype
>>> idx.get_indexer(["a", None])
array([0, 2]) Here we also infer cc @pandas-dev/pandas-core |
In the very long term, I think only I'm +1 on special casing |
This isn't how that works today though, right? >>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", np.nan]).get_loc(None)
KeyError: None I think there is something that is going to be lost in translation either way when going from object-backed storage to a real string data type, but I think it is the easiest for users to understand if we are more strict about how missing values are handled |
It depends on how you look at it. For object dtype we don't do any coercing of the input in constructors, so we also don't do that for the input for the indexer ( It is true that whatever choice we make, there is some change in behaviour ( And in addition, keeping |
Ah this is a great point. OK I am fine with continuing along that path then. Long term, it might get tricky with wanting to store None and missing values alongside one another in an object index, but that probably makes sense to solve alongside with the disambiguation of np.nan and pd.NA with floating point dtypes that has been discussed in PDEP 16 |
#59765 is the general issue about consistency in index lookups for missing values, but shorter term, we at least need to fix being able to lookup missing values for the future string dtype.
With object dtype, this works fine:
But when enabling the string dtype, neither None or np.nan work as key:
The text was updated successfully, but these errors were encountered: