Add dynamic filter (bounds) pushdown to HashJoinExec #16445

adriangb · 2025-06-18T15:00:30Z

Part of #7955.

My goal here is to lay the groundwork for pushing down joins.
I am only implementing bounds pushdown because I am sure that is cheap and it will probably be quite effective in many cases. And it will be ~ easy to push down a reference to the whole hash table in a followup PR.

Another followup that can be done is to enable parent filter pushdown through HashJoinExec. Similar to FilterExec this requires adjusting parent filters for the join's projection, but we also need to check what columns each filter refers to to push it into the correct child (or not push it if it refers to columns from both children and can't be disjoint).

Dandandan · 2025-06-18T15:44:29Z

datafusion/physical-plan/src/joins/hash_join.rs

+            }
+
+            // Compute min/max using ScalarValue's utilities
+            let mut min_val = ScalarValue::try_from_array(array, 0)?;


I think we should arrow kernel for this (this is slow).

Maybe we re-use datafusion/functions-aggregate/src/min_max.rs? Seems like there's a lot of complexity there related the types that we wouldn't want t to re-implement

Sadly min_batch is not public and either way functions-aggregate is not a dependency of physical-plan

Could we move these functions into functions-aggreage-common?

makes sense

Dandandan · 2025-06-18T15:48:22Z

I tink we should also consider a heuristic for not evaluating the filter if it's not useful.

Also I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead.

Dandandan · 2025-06-18T15:49:01Z

Sorry, misclicked a button.

adriangb · 2025-06-18T15:50:00Z

I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead

My thought was that for some cases the bounds checks are going to be quite effective at pruning and they should always be cheap to compute and cheap to apply. I'm surprised you say that they might create a lot of overhead?

Dandandan · 2025-06-18T22:25:05Z

I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead

My thought was that for some cases the bounds checks are going to be quite effective at pruning and they should always be cheap to compute and cheap to apply. I'm surprised you say that they might create a lot of overhead?

Maybe I should articulate it a bit more.

If we are only filtering out based on statistics, min/max might make sense to quickly filter out large chunks of rows.
If we are filtering on values (e.g. filter pushdown) - I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows).

I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the right side, or we know (based on column statistics) it will filter out rows.

adriangb · 2025-06-18T23:07:22Z

I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows)

I'm surprised that the hash table lookup, even if O(1), has such a small constant factor that its ~ a couple of binary comparisons. That said a reason to still do both is stats and filter caching: simple filters like col >= 123 and col <= 456 can be used for stats pruning and can easily be cached (for example for filter caching based indexing). So even if performance is not strictly better there is still something to be said for including a simple filter in addition to the hash table lookup.

adriangb · 2025-06-19T20:31:34Z

I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the right side, or we know (based on column statistics) it will filter out rows

Datafusion is generally not great at these things: we often don't have enough stats / info to make decisions like this.

Dandandan · 2025-06-20T13:01:47Z

I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think we want to avoid to also evaluate the min/max expression (at least for all rows)

I'm surprised that the hash table lookup, even if O(1), has such a small constant factor that its ~ a couple of binary comparisons. That said a reason to still do both is stats and filter caching: simple filters like col >= 123 and col <= 456 can be used for stats pruning and can easily be cached (for example for filter caching based indexing). So even if performance is not strictly better there is still something to be said for including a simple filter in addition to the hash table lookup.

It's hard to say generally, but a hashtable lookup which fits into cache on a u64 key can be really fast.

adriangb · 2025-06-20T13:08:48Z

It's hard to say generally, but a hashtable lookup which fits into cache on a u64 key can be really fast.

I guess only benchmarks can tell. But I still think the scalar bounds are worth keeping for stats pruning reasons.

xudong963

Do we have any metrics to record how much data is filtered by dynamic join filter?

datafusion/physical-plan/src/joins/hash_join.rs

xudong963 · 2025-06-24T05:16:17Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

@@ -433,6 +433,117 @@ async fn test_topk_dynamic_filter_pushdown() {
    );
 }

+#[tokio::test]
+async fn test_hashjoin_dynamic_filter_pushdown() {


Can we add some tests for multiple joins? Such as

Join (t1.a = t2.b) / \ t1 Join(t2.c = t3.d) / \ t3 t2

Such test can check

dynamic filters are pushed down to right scan node

dynamic filters aren't missed during pushdown

I've added a test that I think matches your suggestion

adriangb · 2025-06-24T19:10:15Z

datafusion/physical-plan/src/filter_pushdown.rs

@@ -353,6 +353,18 @@ impl FilterDescription {
        }
    }

+    pub fn with_child_pushdown(


More APIs 🤮. I really need to circle back to doing some whiteboard design for these. It's complex and won't be pretty but I'm sure it can be better than it is right now.

Dandandan · 2025-06-24T20:05:46Z

To share some experience, we recently added some similar pushdown for HashJoinExec (at Coralogix) using sharing of Arc<JoinLeftData> / comparing column hashes and it is seems so far very effective with predicate pushdown enabled.

adriangb · 2025-06-24T20:46:06Z

I was originally planning on keeping this PR smaller but it's been growing so I might as well add the Arc<LeftData> :)

Dandandan · 2025-06-24T21:07:16Z

I was originally planning on keeping this PR smaller but it's been growing so I might as well add the Arc :)

Feel free to PR it however you like ;)

adriangb · 2025-06-25T12:52:04Z

@Dandandan any chance you'd be willing to contribute your implementation of sharing Arc<LeftData> so we use something we know is working / I don't have to re-invent the wheel? I think you can just push it to this branch.

adriangb · 2025-06-26T00:23:42Z

@alamb I'd be interested to see what benchmarks say if you don't mind kicking them off?

xudong963 · 2025-06-26T07:30:55Z

@alamb I'd be interested to see what benchmarks say if you don't mind kicking them off?

IIRC, the optimization will speed up tpch benchmark, we may run it directly. Or directly construct a small table and probe big table to see the effect.

github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jun 18, 2025

adriangb mentioned this pull request Jun 18, 2025

Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) #7955

Open

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 18, 2025

Dandandan reviewed Jun 18, 2025

View reviewed changes

Dandandan closed this Jun 18, 2025

Dandandan reopened this Jun 18, 2025

xudong963 self-requested a review June 19, 2025 10:14

xudong963 reviewed Jun 24, 2025

View reviewed changes

adriangb added 9 commits June 24, 2025 12:57

initial run

236adbe

inline assertions

3cf82df

simplify

7fdef8e

remove unecessary option

f3feac3

clippy

2a13039

clippy

b2db157

add parent filter support

5799f63

handle mark columns

37b61f5

fmt

04efcc1

adriangb force-pushed the hash-join-pushdown branch from 49d1636 to 04efcc1 Compare June 24, 2025 19:08

github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 24, 2025

adriangb commented Jun 24, 2025

View reviewed changes

display suggestion

8e61c6c

fmt

ab8b2e6

Add dynamic filter (bounds) pushdown to HashJoinExec #16445

Are you sure you want to change the base?

Add dynamic filter (bounds) pushdown to HashJoinExec #16445

Conversation

adriangb commented Jun 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jun 18, 2025

Uh oh!

Dandandan commented Jun 18, 2025

Uh oh!

adriangb commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Jun 18, 2025

Uh oh!

adriangb commented Jun 19, 2025

Uh oh!

Dandandan commented Jun 20, 2025

Uh oh!

adriangb commented Jun 20, 2025

Uh oh!

xudong963 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jun 24, 2025

Uh oh!

adriangb commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan commented Jun 24, 2025

Uh oh!

adriangb commented Jun 25, 2025

Uh oh!

adriangb commented Jun 26, 2025

Uh oh!

xudong963 commented Jun 26, 2025

Uh oh!

Uh oh!

adriangb commented Jun 18, 2025 •

edited

Loading

Dandandan commented Jun 18, 2025 •

edited

Loading

xudong963 left a comment •

edited

Loading

adriangb commented Jun 24, 2025 •

edited

Loading