-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #6024] Insert crc checksum observer after all project nodes #6025
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6025 +/- ##
============================================
+ Coverage 61.07% 61.08% +0.01%
Complexity 23 23
============================================
Files 623 623
Lines 37090 37103 +13
Branches 5028 5029 +1
============================================
+ Hits 22653 22666 +13
- Misses 11986 11992 +6
+ Partials 2451 2445 -6 ☔ View full report in Codecov by Sentry. |
override def apply(plan: LogicalPlan): LogicalPlan = { | ||
if (conf.getConf(INSERT_CHECKSUM_OBSERVER_AFTER_PROJECT_ENABLED)) { | ||
plan resolveOperatorsUp { | ||
case p: Project if p.resolved && p.getTagValue(INSERT_COLLECT_METRICS_TAG).isEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the plan is for writing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the plan is for writing ?
Just add observer after all project nodes, I think there will also be project node before writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we only add collect metrics for project ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we only add collect metrics for project ?
Most data inconsistencies are caused by project, such as using udf/udaf or type conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we remove AfterProject from the rule name and support more nodes in the future?
🔍 Description
Issue References 🔗
This pull request fixes #6024
Describe Your Solution 🔧
Insert crc checksum observer after all
Project
nodes to compare data at all stage of SQL.Types of changes 🔖
Test Plan 🧪
Behavior Without This Pull Request ⚰️
Behavior With This Pull Request 🎉
Related Unit Tests
org.apache.spark.sql.observe.InsertChecksumObserverAfterProjectSuite
Checklist 📝
Be nice. Be informative.