Spillable broadcast/shuffled hash join #721

richox · 2024-12-26T02:41:49Z

Is your feature request related to a problem? Please describe.
currently hash joins use a monolithic in-memory hash table for joining, which may cause oom in the case where offheap memory is small.

Describe the solution you'd like
add a row/memory limit for building hash table. when exceeded, turn into a spill-merge method:

build side data is shuffled into N buckets. (say N=1024)
build buckets into separated hash tables, small buckets can be coalesced.
shuffle probe side into the same N partitions.
read each partition, join with the corresponding hash table.

Describe alternatives you've considered
this solves oom problem in most cases, however when there are data skewing, the shuffle does not work, we may fallback to sort-based joining in such situation.

Additional context
Add any other context or screenshots about the feature request here.

richox added the enhancement New feature or request label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spillable broadcast/shuffled hash join #721

Spillable broadcast/shuffled hash join #721

richox commented Dec 26, 2024

Spillable broadcast/shuffled hash join #721

Spillable broadcast/shuffled hash join #721

Comments

richox commented Dec 26, 2024