Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
4bc49ed
fixed comparison of string column to mixed object column (issue #60228)
TEARFEAR Nov 21, 2024
0def761
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
c4da919
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
900f3b1
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 22, 2024
8db4edc
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 22, 2024
d4ea527
Merge branch 'main' into bug-update-60228
TEARFEAR Nov 22, 2024
d4ae654
CI/BUG: Remove `trim()` function on `comment-commands.yml` (#60397)
KevsterAmp Nov 22, 2024
eaa8b47
DOC: Fixed spelling of 'behaviour' to 'behavior' (#60398)
Nov 22, 2024
ee0902a
BUG: Convert output type in Excel for MultiIndex with period levels (…
ZKaoChi Nov 22, 2024
a2ceb52
fix issue #60410 (#60412)
partev Nov 25, 2024
e78df6f
DOC: fix SA01 for pandas.errors.UnsortedIndexError (#60404)
tuhinsharma121 Nov 25, 2024
cbd90ba
Fix BUG: Cannot shift Intervals that are not closed='right' (the defa…
lfffkh Nov 25, 2024
bca4b1c
DOC: fix SA01,ES01 for pandas.errors.PossibleDataLossError (#60403)
tuhinsharma121 Nov 25, 2024
582740b
DOC: fix SA01 for pandas.errors.OutOfBoundsTimedelta (#60402)
tuhinsharma121 Nov 25, 2024
9fab4eb
DOC: fix SA01,ES01 for pandas.errors.DuplicateLabelError (#60399)
tuhinsharma121 Nov 25, 2024
00c2207
DOC: fix SA01,ES01 for pandas.errors.InvalidIndexError (#60400)
tuhinsharma121 Nov 25, 2024
39dcbb4
DOC: fix SA01 for pandas.errors.NumExprClobberingError (#60401)
tuhinsharma121 Nov 25, 2024
0b6cece
TST: Avoid hashing np.timedelta64 without unit (#60416)
mroeschke Nov 25, 2024
759874e
BUG: Fix formatting of complex numbers with exponents (#60417)
snitish Nov 26, 2024
b1c2ba7
Bump pypa/cibuildwheel from 2.21.3 to 2.22.0 (#60414)
dependabot[bot] Nov 26, 2024
ab757ff
DOC: fix docstring api.types.is_re_compilable (#60419)
sooooooing Nov 26, 2024
be41966
DOC: Clarifying pandas.melt method documentation by replacing "massag…
ohe Nov 26, 2024
fd570f4
replace twitter->X (#60426)
partev Nov 26, 2024
98f7e4d
String dtype: use ObjectEngine for indexing for now correctness over …
jorisvandenbossche Nov 26, 2024
106f33c
DOC: Add type hint for squeeze method (#60415)
jasonmokk Nov 26, 2024
1d809c3
BUG: fix NameError raised when specifying dtype with string having "[…
yuanx749 Nov 27, 2024
89e2efc
fixed comparison of string column to mixed object column (issue #60228)
TEARFEAR Nov 21, 2024
a832418
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
7152b01
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
61ffbc0
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 22, 2024
104a60f
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 22, 2024
82625a3
Merge remote-tracking branch 'origin/bug-update-60228' into bug-updat…
TEARFEAR Nov 28, 2024
658f757
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 28, 2024
0129c68
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
65ae2e2
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 21, 2024
497e8a6
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 28, 2024
62239ee
Merge remote-tracking branch 'origin/bug-update-60228' into bug-updat…
TEARFEAR Nov 28, 2024
56bc8b1
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 28, 2024
b301ac0
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 28, 2024
01887f8
BUG (string dtype): comparison of string column to mixed object colum…
TEARFEAR Nov 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -769,6 +769,7 @@ Styler
Other
^^^^^
- Bug in :class:`DataFrame` when passing a ``dict`` with a NA scalar and ``columns`` that would always return ``np.nan`` (:issue:`57205`)
- Bug in :func:`comparison_op` where comparing a ``string`` dtype array with an ``object`` dtype array containing mixed types would raise a ``TypeError`` when PyArrow-based strings are enabled. (:issue:`60228`)
- Bug in :func:`eval` on :class:`ExtensionArray` on including division ``/`` failed with a ``TypeError``. (:issue:`58748`)
- Bug in :func:`eval` where the names of the :class:`Series` were not preserved when using ``engine="numexpr"``. (:issue:`10239`)
- Bug in :func:`eval` with ``engine="numexpr"`` returning unexpected result for float division. (:issue:`59736`)
Expand Down
16 changes: 15 additions & 1 deletion pandas/core/ops/array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
is_numeric_v_string_like,
is_object_dtype,
is_scalar,
is_string_dtype,
)
from pandas.core.dtypes.generic import (
ABCExtensionArray,
Expand All @@ -53,7 +54,10 @@

from pandas.core import roperator
from pandas.core.computation import expressions
from pandas.core.construction import ensure_wrapped_if_datetimelike
from pandas.core.construction import (
array as pd_array,
ensure_wrapped_if_datetimelike,
)
from pandas.core.ops import missing
from pandas.core.ops.dispatch import should_extension_dispatch
from pandas.core.ops.invalid import invalid_comparison
Expand Down Expand Up @@ -321,6 +325,16 @@ def comparison_op(left: ArrayLike, right: Any, op) -> ArrayLike:
"Lengths must match to compare", lvalues.shape, rvalues.shape
)

if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or (
is_object_dtype(lvalues) and is_string_dtype(rvalues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for string dtype for an array can be expensive in case the array is object dtype (at that point it will scan all values to check if they are strings). So we might want to try avoid that at this level.
I think we could handle the issue specifically for the ArrowExtensionArray itself (see the code I referenced in #60228 (comment))

):
if lvalues.dtype.name == "string" and rvalues.dtype == object:
lvalues = lvalues.astype("string")
rvalues = pd_array(rvalues, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to do the casting the other way around. Instead of casting the object to string and then compare both as strings, I think we have to cast the string to object and compare both as object dtype.

The reason for this is that casting to string might actually convert values to strings, and then we are no longer doing the comparison for the original values.

>>> ser_string = pd.Series(["1", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0    False
1     True
dtype: bool

>>> ser_string == ser_mixed.astype("string")
0    True
1    True
dtype: bool

So if we would do that casting under the hood, the result would change in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should add this case to the tests!

elif rvalues.dtype.name == "string" and lvalues.dtype == object:
rvalues = rvalues.astype("string")
lvalues = pd_array(lvalues, dtype="string")

if should_extension_dispatch(lvalues, rvalues) or (
(isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
and lvalues.dtype != object
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/series/methods/test_compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,17 @@ def test_compare_datetime64_and_string():
tm.assert_series_equal(result_eq1, expected_eq)
tm.assert_series_equal(result_eq2, expected_eq)
tm.assert_series_equal(result_neq, expected_neq)


def test_comparison_string_mixed_object():
# Issue https://github.com/pandas-dev/pandas/issues/60228
pd.options.future.infer_string = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add this for CI, because we have a separate CI build that enables this option for the full test suite.

Now, this can still be useful to test locally, but the way you can do this is with setting an environment variable (on linux I can do PANDAS_FUTURE_INFER_STRING=1 pytest ... to run the test with the option enabled.


ser_string = pd.Series(["a", "b"], dtype="string")
ser_mixed = pd.Series([1, "b"])

result = ser_string == ser_mixed
expected = pd.Series([False, True], dtype="boolean")
tm.assert_series_equal(result, expected)

pd.options.future.infer_string = False
Loading