fix: Fix smj result mismatch issue in semi, anit and full outer join #11771
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
semi and anti
When we were validating the results of SMJ on TPCH, we discovered an issue where semi join and anti join produce inconsistent results when the left join key has multiple right matched rows.
For semi join: Suppose the left input is 2, and it matches with two right inputs, right input1: 2 and right input2: 2. The current code produces an output of
, resulting in two records that meet the condition, which is inconsistent with the semantics of left semi join. In this PR, we leverage the features of JoinTracker, using the firstMatch variable to ensure that in a group of left key matches, the output only records the first matching record, and other matching records are not recorded.
For anti join: In the case of anti join with a filter, we encountered a similar issue. The solution is similar to that of semi join, utilizing the features of JoinTracker to retain only the rows in the same left key match group that have no matches on the right side.
full outer join fix
Assume the left table has columns a and b:
The right table has columns c and d:
The two tables are joined using a full outer join on the condition a == c and b < d. During the doGetOutput phase, the result is matched using a left join, resulting in 3 * 4 = 12 records:
Then, in the filter method, the records are filtered based on the condition b < d, resulting in the following:
Finally, records from the left table that do not have a match are filled with nulls, resulting in the following final output:
The above result is incorrect because it is missing rows from the right table that do not have a match. Among the 12 rows above, rows 0, 4, and 8 correspond to the first record (2, 3) from the right table, rows 1, 5, and 9 correspond to the second record (2, -1) from the right table, rows 2, 6, and 10 correspond to the third record (2, -1) from the right table, and rows 3, 7, and 11 correspond to the fourth record (2, 3) from the right table. From the matching results above, rows 1, 5, and 9, as well as rows 2, 6, and 10, are all false, meaning that the third and fourth records from the right table do not have matching rows. Therefore, the final result is missing rows from the right table that do not have matches. The correct final result should be:
This PR calls the filter function when the keys are the same to filter out rows from the right table that do not have matches. If a row from the right table does not have a match, a new record is inserted with the corresponding columns from the left table set to null.