Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix smj result mismatch issue in semi, anit and full outer join #11771

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JkSelf
Copy link
Collaborator

@JkSelf JkSelf commented Dec 6, 2024

semi and anti

When we were validating the results of SMJ on TPCH, we discovered an issue where semi join and anti join produce inconsistent results when the left join key has multiple right matched rows.
For semi join: Suppose the left input is 2, and it matches with two right inputs, right input1: 2 and right input2: 2. The current code produces an output of

2 2
2 2

, resulting in two records that meet the condition, which is inconsistent with the semantics of left semi join. In this PR, we leverage the features of JoinTracker, using the firstMatch variable to ensure that in a group of left key matches, the output only records the first matching record, and other matching records are not recorded.
For anti join: In the case of anti join with a filter, we encountered a similar issue. The solution is similar to that of semi join, utilizing the features of JoinTracker to retain only the rows in the same left key match group that have no matches on the right side.

full outer join fix

Assume the left table has columns a and b:

a       b
2	100
2	1
2	1

The right table has columns c and d:

c       d
2	3
2	-1
2	-1
2	3

The two tables are joined using a full outer join on the condition a == c and b < d. During the doGetOutput phase, the result is matched using a left join, resulting in 3 * 4 = 12 records:

No      a        b       c      d
0	2	100	 2	3
1	2	100	 2	-1
2	2	100	 2	-1
3	2	100	 2	3
4	2	1	 2	3
5	2	1	 2	-1
6	2	1	 2	-1
7	2	1	 2	3
8	2	1	 2	3
9	2	1	 2	-1
10	2	1	 2	-1
11	2	1	 2	3

Then, in the filter method, the records are filtered based on the condition b < d, resulting in the following:

No	a	b	c	d	matched
0	2	100	2	3	FALSE
1	2	100	2	-1	FALSE
2	2	100	2	-1	FALSE
3	2	100	2	3	FALSE
4	2	1	2	3	TRUE
5	2	1	2	-1	FALSE
6	2	1	2	-1	FALSE
7	2	1	2	3	TRUE
8	2	1	2	3	TRUE
9	2	1	2	-1	FALSE
10	2	1	2	-1	FALSE
11	2	1	2	3	TRUE

Finally, records from the left table that do not have a match are filled with nulls, resulting in the following final output:

No	a	b	c	 d
0	2	100	null null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3

The above result is incorrect because it is missing rows from the right table that do not have a match. Among the 12 rows above, rows 0, 4, and 8 correspond to the first record (2, 3) from the right table, rows 1, 5, and 9 correspond to the second record (2, -1) from the right table, rows 2, 6, and 10 correspond to the third record (2, -1) from the right table, and rows 3, 7, and 11 correspond to the fourth record (2, 3) from the right table. From the matching results above, rows 1, 5, and 9, as well as rows 2, 6, and 10, are all false, meaning that the third and fourth records from the right table do not have matching rows. Therefore, the final result is missing rows from the right table that do not have matches. The correct final result should be:

No	a	b	c	 d
0	2	100	null   null
1	2	1	2	 3
2	2	1	2	 3
3	2	1	2	 3
4	2	1	2	 3
5       null    null    2       -1
6       null    null    2       -1

This PR calls the filter function when the keys are the same to filter out rows from the right table that do not have matches. If a row from the right table does not have a match, a new record is inserted with the corresponding columns from the left table set to null.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2024
Copy link

netlify bot commented Dec 6, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8032743
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67735b8248b58f0008286354

@JkSelf
Copy link
Collaborator Author

JkSelf commented Dec 6, 2024

@pedroerp @xiaoxmeng Can you help to review this PR? Thanks.

@JkSelf JkSelf force-pushed the semi-anti-fix branch 2 times, most recently from c598eef to 77d2d90 Compare December 9, 2024 07:30
@JkSelf JkSelf changed the title fix: Fix semi join and anti join result mismatch issue fix: Fix smj result mismatch issue in semi, anit and full outer join Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. merge-join
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants