-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range/inequality joins are slow #8393
Comments
I just noticed that what I really want is to actually do a RIGHT join. That is, if there is no matching pricing for a timestamp, it should give null. Changing the query to that, Datafusion is much faster. I believe it's because with a RIGHT join, pricing becomes the outer table (single partition), while timestamps becomes the inner table (unspecified partitioning), which allows for greater parallelism (see https://github.com/apache/arrow-datafusion/blob/e19c669855baa8b78ff86755803944d2ddf65536/datafusion/physical-plan/src/joins/nested_loop_join.rs#L72-L77C4) But I think the issue should still be open - the LEFT join is still slower |
I think |
I stared trying to collect a list of various join improvments on #8398 |
I am interested in this ticket. Since it is a pretty major project, I will write a proposal first. |
Thank you @my-vegetable-has-exploded -- that is a great idea cc @korowa / @viirya / @metesynnada who have been involved in Join implementations recently and who may be interested as well |
Disregarding IEJoin -- So, if i'm not mistaken, this issue is mostly about covering NLJoin in join_selection.rs. UPD: in addition, to make join reordering useful, it's also required to modify NLJoin, since currently it chooses build-side based on logical join type. |
I think it is a good idea to improve performance in this scenario. Your pr is also good for me. But I think it is also ok to keep old parallelism strategy. In my opinion, the old paralleism strategy should works, but the check in I think it may another way to write a new enforce_distribution strategy for |
I don't think it's proper way to go -- it'll give some benefits in terms of runtime, but it will be suboptimal in terms of memory utilization, and cputime (as we'll need to perform BuildSideRows * NumberOfPartitions filter evaluations instead of BuildSideRows * 1, where 1 is probe side input batches) |
I don't think this issue should be closed. #9676 seems to take care of ordering but I think it doesn't improve range/inequality joins much? |
My intention was to fix NLJoin parallelism issue due to fixed build-side choice (since right join instead of left had acceptable performance, as it was claimed above), and in the same time we also have #318 for specialized operator implementation, so, I supposed #9676 to be enough. Don't mind to keep it open, though. |
Could anyone do me a favour here? |
Describe the bug
Joins where the
ON
filter are not equality, but rather inequalities like<
, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor".The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a
NestedLoopJoinExec
, while DuckDB uses aIEJoin
.Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue #318).
To Reproduce
Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:
$ duckdb < repro-dataset.sql
We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as
repro-range-query.sql
.DuckDB performance:
Datafusion performance:
$ time datafusion-cli -f repro-range-query.sql ... real 0m8.269s user 0m6.358s sys 0m1.907s
Expected behavior
It would be nice if the above query (or something equivalent) would be faster in Datafusion.
If someone knows of a better way to express the query, then that could also be a workaround for me.
Additional context
Machine tested on:
CPU:Ryzen 3900x
OS: Ubuntu 22.04
Versions used:
The text was updated successfully, but these errors were encountered: