[FEA] Kudo shuffle read should be retryable and spillable on Host Memory #12215

binmahone · 2025-02-25T01:57:37Z

Is your feature request related to a problem? Please describe.

Shuffle read is a heavy consumer of host memory. For a executor with 16 cores, it will take up to 16 * avg_shuffle_partition_size * 2 bytes. It is multipled by 2 because in the concat step, all the small pieces are concatenated into a big batch, and they both reside in host memory.

After #12169 #12158 #12184 are all checked in, we should be able to make kudo shuffle read retryable and spillable.

Another perspective for optimizing host memory usage is reconsider: Do we really need 16 threads doing shuffle read at the same time? After all concurrent GPU task number is typically 2~4, which means most tasks will not be scheduled even if it has finished shuffle read and await executing. Maybe we should start with 16/2=8 threads doing the shuffle read, other threads should not start shuffle reading to avoid wasting host memory. But this is out the scope of this issue

binmahone added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 25, 2025

binmahone mentioned this issue Feb 25, 2025

[FEA] Limit Host Memory Usage #8874

Open

35 tasks

binmahone changed the title ~~[FEA] Kudo shuffle read should be retryable and spillable~~ [FEA] Kudo shuffle read should be retryable and spillable on Host Memory Feb 25, 2025

binmahone mentioned this issue Feb 25, 2025

support unspill for SpillableHostBuffer #12186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Kudo shuffle read should be retryable and spillable on Host Memory #12215

[FEA] Kudo shuffle read should be retryable and spillable on Host Memory #12215

binmahone commented Feb 25, 2025 •

edited

Loading

[FEA] Kudo shuffle read should be retryable and spillable on Host Memory #12215

[FEA] Kudo shuffle read should be retryable and spillable on Host Memory #12215

Comments

binmahone commented Feb 25, 2025 • edited Loading

binmahone commented Feb 25, 2025 •

edited

Loading