Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: when there is no intersection of rows as the datasets have no mutual key/connection #380

Open
gbhogle1789 opened this issue Feb 13, 2025 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@gbhogle1789
Copy link

gbhogle1789 commented Feb 13, 2025

Number of rows with some compared columns unequal: 3
it should show Number of rows with some compared columns unequal: 0. if no differences found in dataframe.

Originally posted by @gbhogle1789 in #377

Unequal is showing 0 for mismatched as well and compare.all_mismatch(ignore_matching_cols=False).show() not showing any value.


import pandas as pd
from datacompy import SparkSQLCompare
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [5,8,9]})
df2 = pd.DataFrame({'col1': [1, 2,6], 'col2': [3,2,5]})

sdf1 = spark.createDataFrame(df)
sdf2 = spark.createDataFrame(df2)

compare = SparkSQLCompare(spark, sdf1, sdf2, join_columns=["col1", "col2"])
compare.all_mismatch(ignore_matching_cols=False).show()

print(compare.report())

Output

+----+----+
|col1|col2|
+----+----+
+----+----+

DataComPy Comparison

DataFrame Summary

DataFrame Columns Rows
0 df1 2 3
1 df2 2 3

Column Summary

Number of columns in common: 2
Number of columns in df1 but not in df2: 0 []
Number of columns in df2 but not in df1: 0 []

Row Summary

Matched on: col1, col2
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 0
Number of rows in df1 but not in df2: 3
Number of rows in df2 but not in df1: 3

Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 0

Column Comparison

Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0

Sample Rows Only in df1 (First 10 Columns)

col1_df1 col2_df1 _merge_left
0 1 5 True
1 2 8 True
2 3 9 True

Sample Rows Only in df2 (First 10 Columns)

col1_df2 col2_df2 _merge_right
0 1 3 True
1 2 2 True
2 6 5 True

@gbhogle1789 gbhogle1789 changed the title SparkSQLCompare: when mismatched records then unequal rows showing 0 and compare.all_mismatch(ignore_matching_cols=False).show() not showing any records SparkSQLCompare: when mismatched records then unequal rows showing 0 and compare.all_mismatch(ignore_matching_cols=False) not showing mismatched records Feb 13, 2025
@gbhogle1789 gbhogle1789 changed the title SparkSQLCompare: when mismatched records then unequal rows showing 0 and compare.all_mismatch(ignore_matching_cols=False) not showing mismatched records SparkSQLCompare: unequal rows showing 0 when dataframes are not matching and compare.all_mismatch(ignore_matching_cols=False) not showing mismatched records Feb 13, 2025
@fdosani
Copy link
Member

fdosani commented Feb 13, 2025

@gbhogle1789 Are you using the new fix which we just released?
It would be helpful if you could articulate what you are expecting to see.

@fdosani
Copy link
Member

fdosani commented Feb 13, 2025

@rhaffar I think in this might be another corner case

@fdosani fdosani added the bug Something isn't working label Feb 13, 2025
@fdosani fdosani changed the title SparkSQLCompare: unequal rows showing 0 when dataframes are not matching and compare.all_mismatch(ignore_matching_cols=False) not showing mismatched records bug: when there is no intersection of rows as the datasets have no mutual key/connection Feb 13, 2025
@gbhogle1789
Copy link
Author

Yes, I have upgraded dataCompy packages

@fdosani
Copy link
Member

fdosani commented Feb 13, 2025

Just want to jot down my thoughts here. This is a specific corner case:

  • Where all the fields in the dataset are used as join_columns
  • under the hood we are comparing typically excluding the join_columns
  • this makes it a bit interesting to deal with since you aren't comparing anything in this corner case and just joining and then parsing out the left, right, combined results (since the join is a compare in itself)

all_mismatch is a fairly easy fix since it is a standalone function. Adding the following to short circuit seems to work fine:

        if self.only_join_columns:
            LOG.info(
                "Only join keys in data, returning mismatches based on unq_rows"
            )
            return pd.concat([self.df1_unq_rows, self.df2_unq_rows])

We could refactor the underlying logic to treat the join columns just like rows we compare but that might be a bit of work. Otherwise short circuiting when the only columns are the join_columns could an option but more hacky.

@gbhogle1789
Copy link
Author

My use case is when join columns are unknown, how can I find the difference between two dataframes. Since we are just using it for validation we don't want more efforts to find the join columns and then compare. Please let me know if any solution provided in Datacompy.

@fdosani
Copy link
Member

fdosani commented Feb 14, 2025

@rhaffar Here is just a rough idea of what I was thinking. Don't love it, but just to get thinking about it. Not sure if you have any suggestion. This could be refined or cleaned up a bit. develop...all-join-mismatch

@gbhogle1789 if you just need to check to see if 2 spark dataframes are identical (without join columns) then this might suffice better than datacompy:

from pyspark.testing import assertDataFrameEqual
assertDataFrameEqual(df_actual, df_expected)

I'd need a bit more context about what you are comparing to fully understand the issue to be honest.

@gbhogle1789
Copy link
Author

@fdosani
I am trying to compare tables in hive and databricks. and find the mismatched records between these two objects.
I want the report exactly how it shows in Datacompy not only just checking if two dataframes are equal or not.
I was trying to add common function to compare two dataframes even if user doesn't want to mention Join columns. If join_columns are not mentioned then I am joining with all columns from any of the dataframe.

@fdosani
Copy link
Member

fdosani commented Feb 14, 2025

Gotcha. Will take a further look tomorrow. for context the idea behind the package is to join on something, This is a corner case so will need to work out a tweak here. If you all your columns are used to join there is nothing to actually compare (which is the case we are hitting)

In theory all the stuff is there (when joining on all columns):

  • intersect_rows will give you all the matches
  • df1_unq_rows and df2_unq_rows would tell you which ones are not matching.

@rhaffar
Copy link
Contributor

rhaffar commented Feb 15, 2025

@fdosani here's my flavour of solution, let me know what you think: develop...full_join

In particular I made two changes:

  • Instead of checking that all columns in both dataframes are join columns, I check that the intersection of non-join columns between both dataframes is empty. This accounts for the case where either dataframe has unique columns when all common columns are join columns.
  • Wrap intersect compare of self.only_join_columns() compares into the main intersect compare method.

I didn't see any other way to update the mismatch methods, since what we consider to be a "mismatch" is quite different between both types of compares, so I just used the solution from your branch.

My main concern is that my compare solution isn't very explicit, but let me know what you think.

@fdosani
Copy link
Member

fdosani commented Feb 15, 2025

@rhaffar I'm aligned with what you have. Thanks for taking a look into this. There are maybe some tweaks we can do as we go, but this is a good base/start IMHO. Maybe lets regroup after the weekend and think about:

  • tests we can add to validate these corner cases too.
  • apply this to the other dataframe types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants