-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
intersection of rows as the datasets have no mutual key/connection #385
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good Faisal, just one question/comment
@@ -561,6 +593,15 @@ def all_mismatch(self, ignore_matching_cols: bool = False) -> "pl.DataFrame": | |||
LOG.debug( | |||
f"Column {orig_col_name} is equal in df1 and df2. It will not be added to the result." | |||
) | |||
if len(match_list) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok returning mismatches based on unique rows when this condition is triggered via all match columns being equal and ignore_matching_cols
being set to True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine. In this situation we are deviating from the normal workflow. This corner cases is more of when we are dealing with the join columns, so by definition it will be a "composite key" of all the join columns. So in this case I think it is fine. Would be interested in your thoughts or insights.
Changes in this PR are for Pandas and Polars only.
only_join_columns
which indicates if all the columns are join columns. This is a specific corner case which prompted this PR._intersect_compare
metrics (row_cnt
,match_cnt
)whereonly_join_columns
corner case is present.intersect_rows_match
should returnFalse
if the DataFrame is empty.sample_mismatch
needs to account for the corner case situation rather than look forcolumn + "_match"
all_mismatch
needs to account for the corner case situation rather than look forcolumn + "_match"
@rhaffar I've made a couple of more tweaks after hitting some bugs here and there from the last time you checked. Would appreciate another set of eyes here.
I'm hoping this covers all the weird corner cases where: