-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52226] [SQL] Fix unusual equality checks in three operators #50949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e5afeda
to
35329bd
Compare
Hi @cloud-fan , since this PR modifies some code you introduced in e97ab1d#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38, would you mind taking a look? Thanks! |
@@ -48,8 +48,8 @@ case class BatchScanExec( | |||
// TODO: unify the equal/hashCode implementation for all data source v2 query plans. | |||
override def equals(other: Any): Boolean = other match { | |||
case other: BatchScanExec => | |||
this.batch != null && this.batch == other.batch && | |||
this.runtimeFilters == other.runtimeFilters && | |||
this.getClass == other.getClass && this.batch != null && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this is a case
class, there's no final
modifier, making it still possible to be extended. It seems to me that we should either mark this class as final, or explicitly check class equivalence here to prevent misuse (like this PR), or remove this override completely, so that Scala compiler would automatically generate an equals method that does proper class equivalence check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually Scala case class should not be extended, so this class check is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply.
Usually Scala case class should not be extended
I agree, with the caveat that Scala compiler doesn't prohibit a normal class from inheriting a case class. But anyways, it would be such a corner case that we could close this PR.
What changes were proposed in this pull request?
This PR proposed to make equality checks stricter in BatchScanExec, ContinuousScanExec, and MicroBatchScanExec, by making two objects non-equal if they are of different concrete classes.
Why are the changes needed?
e97ab1d#diff-f82edfef27867e1285af13f3603efbc5e77d81d715d427db4b51f0c3e3a0df14R35-R38 introduced
equals
functions to a few v2 data source operators, while none of the other operators hasequals
override.This means equivalence checks of BatchScanExec, ContinuousScanExec, and MicroBatchScanExec are looser than all other operators. It doesn't seem to be intentional; it looks like an overlook to me - different operators should follow the same set of basic contracts if possible, if not, they shall not be too different from each other. Notably, the original author also left a TODO to "unify" them.
Now we live in a world where most operators have stricter equivalence checks, while a few operators have looser equivalence checks. What could go wrong? Well, since Spark is extensible, it is possible to inherit Spark's operators with modified runtime implementation while delivering same results. In fact, that's what https://github.com/apache/incubator-gluten project does, whereas (most) Spark operators are inherited by Gluten operators. Given the loose equivalence checks of Spark operators, we could end up declaring equivalence between a Spark operator and a Gluten operator.
If Spark starts with a clear contract that operators are "equal" as long as they deliver same results (like the abstract util classes in JDK), it would be probably fine. Now we live in a world where most operators don't do this except for the 3 operators I mentioned above. This is very easy to miss, and has caused unexpected behavior/bugs in downstream applications.
Does this PR introduce any user-facing change?
No, unless "user" means downstream applications that integrates with Spark (like Gluten).
How was this patch tested?
No. Unittests, if any, would be to test that two instances of subclasses of
BatchScanExec
(let's sayBatchScanExec
andMockBatchScanExec
) are not equivalent, but that is kinda obvious just from the code itself.Was this patch authored or co-authored using generative AI tooling?
No.