Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string dtype): looking up missing value in the Index fails #59879

Open
Tracked by #54792
jorisvandenbossche opened this issue Sep 23, 2024 · 5 comments · May be fixed by #60454
Open
Tracked by #54792

BUG (string dtype): looking up missing value in the Index fails #59879

jorisvandenbossche opened this issue Sep 23, 2024 · 5 comments · May be fixed by #60454
Labels
Bug Index Related to the Index class or subclasses Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 23, 2024

#59765 is the general issue about consistency in index lookups for missing values, but shorter term, we at least need to fix being able to lookup missing values for the future string dtype.

With object dtype, this works fine:

>>> idx = pd.Index(["a", "b", None])
>>> idx.get_loc(None)
2

But when enabling the string dtype, neither None or np.nan work as key:

>>> pd.options.future.infer_string = True
>>> idx = pd.Index(["a", "b", None])
>>> idx.get_loc(None)
...
KeyError: None
>>> idx.get_loc(np.nan)
...
KeyError: nan
@jorisvandenbossche
Copy link
Member Author

PR #60329 fixes the main bug mentioned above (idx.get_loc(np.nan) not working for str dtype which uses NaN for missing values).

But besides fixing the most obvious bug, there is still the question about how to deal with backwards compatibility.
Currently, this works:

>>> pd.options.future.infer_string = False
# this shows `get_loc`, but this then applies to `ser[None]` or `None in idx` etc
>>> pd.Index(["a", "b", None]).get_loc(None)
2

because the index is inferred to be object dtype, and None is preserved as is, so looking up None works.

However, with the future string dtype enabled, we no longer infer it as object dtype, but as string dtype and at that point coerce any missing values (None, pd.NA) to NaN. As a result, if we are strict about the index lookup (only the exact sentinel gives a match), the same example as above but with the future string dtype enabled will fail:

>>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None

This is a backwards incompatible change for people that were using None in string-like object dtype Index, and we do have some tests that are currently xfailed because of this.

To preserve backwards compatibility, I think it would make sense to also coerce missing values to NaN as input for the indexer (get_loc(indexer)), like we do it for input in the constructors. Or at least do this for None given this is often used as missing value sentinel in object dtype columns.

To make it more complex: first, how missing value sentinels are matched in indexing operations is quite inconsistent right now across dtypes (i.e. some other dtypes like datetime-like and categorical also coerce any missing value sentinel, and so are not strict), see #59765 for a detailed overview.

Secondly, the above discussion is for scalar lookup (get_loc), but we also have get_indexer. When passing a list-like to get_indexer, we do coerce the list to an Index object, and at that point None will also get coerced already:

>>> pd.options.future.infer_string = True
>>> idx = pd.Index(["a", "b", None])  # gets inferred as `str` dtype
>>> idx.get_indexer(["a", None])
array([0, 2])

Here we also infer ["a", None] as str dtype, and so looking up the None value works. While idx.get_loc(None) would not work if we don't relax the rule about how to match the missing value sentinel in the scalar case, creating an inconsistency between get_loc and get_indexer.

cc @pandas-dev/pandas-core

@rhshadrach
Copy link
Member

In the very long term, I think only pd.NA should be treated as a "dtype-idependent missing value" and None should always be Python's None, but that is not the current behavior today and I do not think we should tackle this until may of the other issues with missing values are sorted out.

I'm +1 on special casing None as proposed above, that pd.Index(["a", "b", None]).get_loc(None) should not raise.

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2024

To preserve backwards compatibility, I think it would make sense to also coerce missing values to NaN as input for the indexer (get_loc(indexer)), like we do it for input in the constructors.

This isn't how that works today though, right?

>>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", np.nan]).get_loc(None)
KeyError: None

I think there is something that is going to be lost in translation either way when going from object-backed storage to a real string data type, but I think it is the easiest for users to understand if we are more strict about how missing values are handled

@jorisvandenbossche
Copy link
Member Author

This isn't how that works today though

It depends on how you look at it. For object dtype we don't do any coercing of the input in constructors, so we also don't do that for the input for the indexer (get_loc/get_indexer). And that indeed gives the strict behaviour for object dtype.
So object dtype is also consistent between constructors and indexers in how it coerces input (i.e. consistent in not doing that). But given we do coerce in constructors for string dtype, I think it makes sense to do that as well for indexers.

It is true that whatever choice we make, there is some change in behaviour (.get_loc(None) previously raising (if NaN was used in the index) and now no longer raising, versus .get_loc(None) previously working (if None was used in the index) and now starting to raise).
But my feeling is that it is more useful to keep .get_loc(None) working when it was working before, instead of preserving the cases when it was raising.

And in addition, keeping .get_loc(None) working ensures consistency with get_indexer within the string dtype.

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2024

But given we do coerce in constructors for string dtype, I think it makes sense to do that as well for indexers.

Ah this is a great point. OK I am fine with continuing along that path then.

Long term, it might get tricky with wanting to store None and missing values alongside one another in an object index, but that probably makes sense to solve alongside with the disambiguation of np.nan and pd.NA with floating point dtypes that has been discussed in PDEP 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants