Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intersection of rows as the datasets have no mutual key/connection #385

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

fdosani
Copy link
Member

@fdosani fdosani commented Feb 26, 2025

Changes in this PR are for Pandas and Polars only.

  • addition of only_join_columns which indicates if all the columns are join columns. This is a specific corner case which prompted this PR.
  • tweaks to _intersect_compare metrics (row_cnt, match_cnt)where only_join_columns corner case is present.
  • intersect_rows_match should return False if the DataFrame is empty.
  • sample_mismatch needs to account for the corner case situation rather than look for column + "_match"
  • all_mismatch needs to account for the corner case situation rather than look for column + "_match"

@rhaffar I've made a couple of more tweaks after hitting some bugs here and there from the last time you checked. Would appreciate another set of eyes here.

I'm hoping this covers all the weird corner cases where:

  1. all columns are join columns
  2. there are no intersecting columns to compare on (aside from the join columns. eg: df1: [a, b, c], df2: [a,b, d], join: [a, b])

Copy link
Contributor

@rhaffar rhaffar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Faisal, just one question/comment

@@ -561,6 +593,15 @@ def all_mismatch(self, ignore_matching_cols: bool = False) -> "pl.DataFrame":
LOG.debug(
f"Column {orig_col_name} is equal in df1 and df2. It will not be added to the result."
)
if len(match_list) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok returning mismatches based on unique rows when this condition is triggered via all match columns being equal and ignore_matching_cols being set to True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine. In this situation we are deviating from the normal workflow. This corner cases is more of when we are dealing with the join columns, so by definition it will be a "composite key" of all the join columns. So in this case I think it is fine. Would be interested in your thoughts or insights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants