You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Shuffle read is a heavy consumer of host memory. For a executor with 16 cores, it will take up to 16 * avg_shuffle_partition_size * 2 bytes. It is multipled by 2 because in the concat step, all the small pieces are concatenated into a big batch, and they both reside in host memory.
After #12169#12158#12184 are all checked in, we should be able to make kudo shuffle read retryable and spillable.
Another perspective for optimizing host memory usage is reconsider: Do we really need 16 threads doing shuffle read at the same time? After all concurrent GPU task number is typically 2~4, which means most tasks will not be scheduled even if it has finished shuffle read and await executing. Maybe we should start with 16/2=8 threads doing the shuffle read, other threads should not start shuffle reading to avoid wasting host memory. But this is out the scope of this issue
The text was updated successfully, but these errors were encountered:
binmahone
changed the title
[FEA] Kudo shuffle read should be retryable and spillable
[FEA] Kudo shuffle read should be retryable and spillable on Host Memory
Feb 25, 2025
Is your feature request related to a problem? Please describe.
Shuffle read is a heavy consumer of host memory. For a executor with 16 cores, it will take up to 16 * avg_shuffle_partition_size * 2 bytes. It is multipled by 2 because in the concat step, all the small pieces are concatenated into a big batch, and they both reside in host memory.
After #12169 #12158 #12184 are all checked in, we should be able to make kudo shuffle read retryable and spillable.
Another perspective for optimizing host memory usage is reconsider: Do we really need 16 threads doing shuffle read at the same time? After all concurrent GPU task number is typically 2~4, which means most tasks will not be scheduled even if it has finished shuffle read and await executing. Maybe we should start with 16/2=8 threads doing the shuffle read, other threads should not start shuffle reading to avoid wasting host memory. But this is out the scope of this issue
The text was updated successfully, but these errors were encountered: