-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: when there is no intersection of rows as the datasets have no mutual key/connection #380
Comments
@gbhogle1789 Are you using the new fix which we just released? |
@rhaffar I think in this might be another corner case |
Yes, I have upgraded dataCompy packages |
Just want to jot down my thoughts here. This is a specific corner case:
if self.only_join_columns:
LOG.info(
"Only join keys in data, returning mismatches based on unq_rows"
)
return pd.concat([self.df1_unq_rows, self.df2_unq_rows]) We could refactor the underlying logic to treat the join columns just like rows we compare but that might be a bit of work. Otherwise short circuiting when the only columns are the |
My use case is when join columns are unknown, how can I find the difference between two dataframes. Since we are just using it for validation we don't want more efforts to find the join columns and then compare. Please let me know if any solution provided in Datacompy. |
@rhaffar Here is just a rough idea of what I was thinking. Don't love it, but just to get thinking about it. Not sure if you have any suggestion. This could be refined or cleaned up a bit. develop...all-join-mismatch @gbhogle1789 if you just need to check to see if 2 spark dataframes are identical (without join columns) then this might suffice better than datacompy: from pyspark.testing import assertDataFrameEqual
assertDataFrameEqual(df_actual, df_expected) I'd need a bit more context about what you are comparing to fully understand the issue to be honest. |
@fdosani |
Gotcha. Will take a further look tomorrow. for context the idea behind the package is to join on something, This is a corner case so will need to work out a tweak here. If you all your columns are used to join there is nothing to actually compare (which is the case we are hitting) In theory all the stuff is there (when joining on all columns):
|
@fdosani here's my flavour of solution, let me know what you think: develop...full_join In particular I made two changes:
I didn't see any other way to update the mismatch methods, since what we consider to be a "mismatch" is quite different between both types of compares, so I just used the solution from your branch. My main concern is that my compare solution isn't very explicit, but let me know what you think. |
@rhaffar I'm aligned with what you have. Thanks for taking a look into this. There are maybe some tweaks we can do as we go, but this is a good base/start IMHO. Maybe lets regroup after the weekend and think about:
|
Originally posted by @gbhogle1789 in #377
Unequal is showing 0 for mismatched as well and compare.all_mismatch(ignore_matching_cols=False).show() not showing any value.
Output
+----+----+
|col1|col2|
+----+----+
+----+----+
DataComPy Comparison
DataFrame Summary
DataFrame Columns Rows
0 df1 2 3
1 df2 2 3
Column Summary
Number of columns in common: 2
Number of columns in df1 but not in df2: 0 []
Number of columns in df2 but not in df1: 0 []
Row Summary
Matched on: col1, col2
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 0
Number of rows in df1 but not in df2: 3
Number of rows in df2 but not in df1: 3
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 0
Column Comparison
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0
Sample Rows Only in df1 (First 10 Columns)
col1_df1 col2_df1 _merge_left
0 1 5 True
1 2 8 True
2 3 9 True
Sample Rows Only in df2 (First 10 Columns)
col1_df2 col2_df2 _merge_right
0 1 3 True
1 2 2 True
2 6 5 True
The text was updated successfully, but these errors were encountered: