Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35785: [C++] Support null type non-key columns for join operation #38383

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

llama90
Copy link
Contributor

@llama90 llama90 commented Oct 21, 2023

Rationale for this change

Supports join on null types for non-key columns.

What changes are included in this PR?

  • Separated the check of supported types for key and non-key columns during join execution
  • Updated the logic associated with Null types for non-key columns.
  • Modified the random tests for Hash Join (TEST(HashJoin, Random)) to allow the creation of Null as non-key column values.

Are these changes tested?

Yes

Are there any user-facing changes?

For users utilizing the Join operation

@llama90 llama90 requested a review from westonpace as a code owner October 21, 2023 12:23
@llama90 llama90 changed the title GH-35785: [C++] Support null type columns for join operation GH-35785: [C++] Support null type non-key columns for join operation Oct 21, 2023
@github-actions github-actions bot added the awaiting review Awaiting review label Oct 21, 2023
@github-actions
Copy link

⚠️ GitHub issue #35785 has been automatically assigned in GitHub to PR creator.

@llama90
Copy link
Contributor Author

llama90 commented Oct 21, 2023

There were policy considerations and a personal lack of understanding regarding the code when it came to supporting Null types for key field columns.

Therefore, although there was an intent to support key columns as well, the initial improvements were made to enable support for non-key columns.

    // Constraints
    RandomDataTypeConstraints key_type_constraints;
    key_type_constraints.Default();
    key_type_constraints.withoutNullColumn();

In the code, when the // key_type_constraints.withoutNullColumn(); part is commented out, the following error can be encountered.

error message

Test 55: RIGHT_OUTER EQ_IS parallel = true bloom_filter = false
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node_test.cc:1231: Failure
Failed
'_error_or_value45.status()' failed with NotImplemented: Function 'equal' has no kernel matching input types (null, null)
/Users/lama/workspace/arrow-2/cpp/src/arrow/compute/expression.cc:551 call.function->DispatchBest(&types)
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node.cc:386 filter.Bind(filter_schema, exec_context)
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node.cc:737 schema_mgr->BindFilter(join_options.filter, left_schema, right_schema, plan->query_context()->exec_context())
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/exec_plan.cc:587 MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options, registry)
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/exec_plan.cc:582 std::get(input).AddToPlan(plan, registry)
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/exec_plan.cc:686 with_sink.AddToPlan(exec_plan.get())
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node_test.cc:945 DeclarationToExecBatches(std::move(join), parallel)
Google Test trace:
/Users/lama/workspace/arrow-2/cpp/src/arrow/acero/hash_join_node_test.cc:1120: RIGHT_OUTER EQ_IS parallel = true bloom_filter = false

It mentions that an "equal" operator for null, null is needed, which seems to require further investigation (however, I'm actually not sure why a join operation on null key columns would be necessary)

@github-actions
Copy link

⚠️ GitHub issue #35785 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review Awaiting review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Support joinining tables with null columns
1 participant