Join on deltalake partitioned DataFrames poor performance #3600

jpedrick-numeus · 2024-12-18T18:55:01Z

Describe the bug

I'm joining an S3 inventory report of Hive structured data(using Ray), that I've translated into a deltalake partitioned by "bucket" and "part0", where "part0" is the first feature of the hive structure. Regardless of the specifics of the format, I'm joining it with another deltalake that has identical columns and partitioning. The resulting join is significantly slower if I run it as a whole, compared to running it in pieces using a filter. There are around 2500 partitions. I haven't timed the performance, but it feels like some kind of n*n operation, where the second join with filters is equivalent to n / 2500.

This seems off, since I would expect the deltalake partitioning to give the join operation's query planner/optimizer enough information to transform the first(without filtering) into the second.

extremely slow:

s3_df = daft.read_deltalake(...)
metadata_df = daft.read_deltalake(...)
joined = s3_df.join(
           metadata_df
           on=['bucket', 'part0', 'part1', 'part2', ]
           how='outer',
        ).write_parquet(...)

acceptable performance:

s3_df = daft.read_deltalake(...)
metadata_df = daft.read_deltalake(...)
for bucket, part0 in parts:
        partition_filter = (daft.col('bucket') == bucket) & (
            daft.col('part0') == part0
        )
        joined = s3_df.filter(partition_filter).join(
           metadata_df.filter(partition_filter), 
           on=['bucket', 'part0', 'part1', 'part2', ]
           how='outer',
        ).write_parquet(...)

To Reproduce

Join two dataframes with identical partitioning where the partition columns are the join-by columns without pre-filtering and write the resulting dataframe. In the fast case case, pre-filter the joins and write the data in pieces.

Expected behavior

These two joins should perform somewhat similarly.

Component(s)

SQL, Ray Runner, Parquet, Other

Additional context

I'm running on a single node Ray cluster w/ pretty slow EBS/Disk IO constraints. I suspect part of the issue is the scope and scale of repartitioning done for the join that results in a large amount of disk caching(also not ideal).

The text was updated successfully, but these errors were encountered:

raunakab · 2024-12-18T19:02:31Z

Hey @jpedrick-numeus, thanks for opening the issue! I'm taking a look through it and trying to understand the problem.

raunakab · 2024-12-18T19:16:15Z

Roping @kevinzwang into the discussion since he's worked with Deltalake in the past.

kevinzwang · 2024-12-18T20:54:50Z

Hi @jpedrick-numeus , thanks for raising this up. Although it is true that your Delta Lake partitioning scheme allows for a more efficient join, Daft currently does not have the logic to take advantage of that and instead still does a shuffle on the whole tables during the join.

This is definitely an optimization we are looking to do in the future, so I'm glad to see that there is interest in it! @jaychia I believe you have been thinking about this too, do you have anything to add?

jaychia · 2024-12-19T00:33:20Z

Yes great catch @jpedrick-numeus ! Was actually just chatting with @samster25 yesterday about this.

This is what's called a storage-partitioned join. Here is the Iceberg proposal for it: https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit?tab=t.0#heading=h.82w8qxfl2uwl

For Daft, we should add a logical notion of partitioning that "groups" sets of physical MicroPartitions. This will let us perform a per-logical-partition join without shuffling between logical partitions!

jpedrick-numeus · 2024-12-19T14:20:15Z

@jaychia @kevinzwang thanks for the quick reply glad to hear this is on the roadmap! I suspect this kind of fundamental operation will be valuable to many users of Daft, so I hope it can get prioritized.

jpedrick-numeus added bug Something isn't working needs triage labels Dec 18, 2024

raunakab added perf and removed needs triage labels Dec 18, 2024

kevinzwang added p2 Nice to have features and removed bug Something isn't working labels Dec 18, 2024

jaychia added the data-catalogs Related to support for Data Catalogs label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join on deltalake partitioned DataFrames poor performance #3600

Join on deltalake partitioned DataFrames poor performance #3600

jpedrick-numeus commented Dec 18, 2024

raunakab commented Dec 18, 2024

raunakab commented Dec 18, 2024

kevinzwang commented Dec 18, 2024

jaychia commented Dec 19, 2024

jpedrick-numeus commented Dec 19, 2024

Join on deltalake partitioned DataFrames poor performance #3600

Join on deltalake partitioned DataFrames poor performance #3600

Comments

jpedrick-numeus commented Dec 18, 2024

Describe the bug

extremely slow:

acceptable performance:

To Reproduce

Expected behavior

Component(s)

Additional context

raunakab commented Dec 18, 2024

raunakab commented Dec 18, 2024

kevinzwang commented Dec 18, 2024

jaychia commented Dec 19, 2024

jpedrick-numeus commented Dec 19, 2024