Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking change of dataclasses.dataclass comparison semantics in 3.13+ #128294

Open
daskol opened this issue Dec 27, 2024 · 10 comments
Open

Breaking change of dataclasses.dataclass comparison semantics in 3.13+ #128294

daskol opened this issue Dec 27, 2024 · 10 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error

Comments

@daskol
Copy link

daskol commented Dec 27, 2024

Bug report

Bug description:

Brief Description

Optimization done in #109870 changed semantic of dataclass comparison.

Description

The pseudo code below shows meaning of changes have been done in #109870. The semantic differs largely because of a shortcut in __eq__ implementation of sequence-like containers (see Objects/object.c). The shortcut essentially does self[i] is other[i]. Consequently, method __eq__ of self[i] is not evaluated for identical objects in 3.12 during dataclasses.dataclass comparison.

def __eq__312(self, other):
    return astuple(self) == astuple(other)

def __eq__313(self, other):
    for lhs, rhs in zip(astuple(self), astuple(other)):
        if not lhs == rhs:
            return False
    else:
        return True

According Python docs (citation below), v3.13 introduces breaking change since it does not consider fields as a tuples for dataclass comparison.

eq: If true (the default), an __eq__() method will be generated. This method compares the class as if it were a tuple of its fields, in order. Both instances in the comparison must be of the identical type.

Test Case

import numpy as np
from dataclasses import dataclass

@dataclass
class A:
    xs: np.ndarray

a = A(np.ones(3))
b = A(a.xs)

print(a == b)  # FAIL (3.13); OK (3.12).
# ValueError: Value The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

CPython versions tested on:

3.13 (3.12)

Operating systems tested on:

Linux

@daskol daskol added the type-bug An unexpected behavior, bug, or error label Dec 27, 2024
@Eclips4 Eclips4 added topic-dataclasses stdlib Python modules in the Lib dir labels Dec 27, 2024
@daskol daskol changed the title Breaking change of dataclass.dataclass comparison semtatics Breaking change of dataclasses.dataclass comparison semtatics Dec 27, 2024
@sobolevn
Copy link
Member

In 3.12 we used to do something like: return (self.x, self.y) == (other.x, other.y), see

cpython/Lib/dataclasses.py

Lines 1085 to 1094 in 3a726be

if eq:
# Create __eq__ method. There's no need for a __ne__ method,
# since python will call __eq__ and negate it.
flds = [f for f in field_list if f.compare]
self_tuple = _tuple_str('self', flds)
other_tuple = _tuple_str('other', flds)
_set_new_attribute(cls, '__eq__',
_cmp_fn('__eq__', '==',
self_tuple, other_tuple,
globals=globals))

Now, instead we do: self.x == other.x and self.y == other.y, which triggers explicit __eq__.

Problems:

  • This is a breaking change
  • Reverting this would be a breaking change for 3.13 users 😢

@daskol
Copy link
Author

daskol commented Dec 27, 2024

Indeed, we have the problem of two chairs here. From my perspective, the right decision would be reverting as soon as possible until major Linux distros do not adopt 3.13 widely.

In advance, I could give some context on a domain where the issue araises first. It is machine learning where data classes are used a lot for neural network representation and maintaining weight collections. Comparison is broken there because overridden __eq__ has vector rather scalar semantic.

@picnixz picnixz changed the title Breaking change of dataclasses.dataclass comparison semtatics Breaking change of dataclasses.dataclass comparison semantics Dec 27, 2024
@picnixz
Copy link
Contributor

picnixz commented Dec 27, 2024

method eq of self[i] is not evaluated for identical objects in 3.12

Considering that we largely short-circuit equality in general, I think we should revert it. In addition, we are now out-of-sync with "This method compares the class as if it were a tuple of its fields, in order" (emphasis mine). While this can be a breaking change, we haven't updated the documentation, so maybe it hasn't been observed.

Most of the time, two objects that are identical (in terms of pointers) should also compare equal, independently of whether there is a custom __eq__ or not, though this is my own opinion. If we're worried about constructing the tuple, instead of doing x[i] == y[i], we could do x[i] is y[i] or x[i] == y[i] to emulate this behaviour but I wonder how performance would be affected.

cc @Yhg1s as the 3.13 RM.

@picnixz picnixz added 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Dec 27, 2024
@picnixz picnixz changed the title Breaking change of dataclasses.dataclass comparison semantics Breaking change of dataclasses.dataclass comparison semantics in 3.13+ Dec 27, 2024
@sobolevn sobolevn self-assigned this Dec 27, 2024
@sobolevn
Copy link
Member

I will send a fix ASAP.

@ericvsmith
Copy link
Member

See also #120645. I don't think that we could back out the change at this point.

@daskol
Copy link
Author

daskol commented Dec 28, 2024

From perspective of PEP-557, data classes should be compared as tuples since it explicitly states that Data Classes can be thought of as “mutable namedtuples with defaults”. I guess that making data classes as close as possible to named tuples was a primary design goal. For this reason, breaking change inroduced by #120645 was not the wisest decision.

from collections import namedtuple
from dataclasses import dataclass

Point = namedtuple('Point', ['x', 'y'])
p1 = Point(0.0, float('nan'))
p2 = Point(0.0, p1.y)
assert p1 == p2  # OK


@dataclass
class Point:
    x: float
    y: float

p1 = Point(0.0, float('nan'))
p2 = Point(0.0, p1.y)
assert p1 == p2  # OK (3.12); FAIL (3.13).

@serhiy-storchaka
Copy link
Member

On other hand, we rarely write (self.x is other.x or self.x == other.x) and (self.y is other.y or self.y == other.y) in hand-made classes. We usually simply write self.x == other.x and self.y == other.y. And dataclasses are intended to remove boilerplate for struct-like classes. namedtuple use identity check because they are tuples, and tuples use identity check because they are sequences and for consistency with the identity check in __contains__. But dataclasses usually are not sequences.

Perhaps in future we will add a per-field boolean option for using identity check.

@sobolevn
Copy link
Member

I think that in this case (no revert) we need to update the docs to reflect the behaviour change. The OP has several examples of older wordings like:

   - *eq*: If true (the default), an :meth:`~object.__eq__` method will be
     generated.  This method compares the class as if it were a tuple
     of its fields, in order.

@daskol
Copy link
Author

daskol commented Dec 30, 2024

But dataclasses usually are not sequences.

Indeed, dataclasses usually are not sequences but they are fancy sequences. At least, it was design goal according to PEP-557. Anyway, I'd like to hear @ericvsmith opinion on this matter since he initially drafted data classes.

Another consequence of this breaking change is compatibility with third party data class libraries (e.g. general-purpose attrs and pydantic or domain purpose chex or flax.struct).

  • General purpose attrs.

    from attrs import define
    
    @define
    class Point:
        x: float
        y: float
    
    a = Point(0.0, float('nan'))
    b = Point(0.0, a.y)
    assert a == b  # OK (3.12); FAIL (3.13).
  • General purpose pydantic.

    from pydantic.dataclasses import dataclass
    
    @dataclass
    class Point:
        x: float
        y: float
    
    a = Point(0.0, float('nan'))
    b = Point(0.0, a.y)
    assert a == b  # OK (3.12); FAIL (3.13).

From my perspective, new semantic of __eq__ is viable change in general but it requires notice period. Third parties and a lot of users appeared in six years since PEP-557 rely on the existing semantics. It will take huge community effort to bring consistency.

@ericvsmith
Copy link
Member

I think it's unfortunate that there wasn't a warning of this change. But at this point it has been released, and rolling it back is problematic. Maybe a better notice in the release notes would be enough?

I don't think the PEP comment dataclasses of can be thought of as “mutable namedtuples with defaults” implies that they should be implemented with tuple semantics where possible. The statement is a very high level overview of the functionality.

@rhettinger : what are your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes stdlib Python modules in the Lib dir topic-dataclasses type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants