-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve throughput for hot cache file reading on GH200 #629
Comments
So far the fastest I could go was about 50 GiB/s on GH200. If you go wide to use all 72 threads, and increase the task size and bounce buffer size to 16 MB. You can push things a bit better. You see all the threads spin up, but it takes 9 ms before the first copy happens. Somehow the threads are getting serialized. Maybe in the OS memory management system somewhere. |
Exploring a similar comparison with a small, 320 GB file, gives similar results. For GH200, the 32 thread performance improves and the >8 MiB task size degrades For Viking, the 8 thread performance improves and 4 MiB task size remains optimal. Looking closer at the 32 thread, 1 MiB task case on GH200, the 320 MB file shows better utilization of the C2C, around 18%. This is still less than the bulk pageable host buffer case which sustains 26% C2C utilization, but much better than the current default around 4%. Overall, for x86-H100 it seems that the default setting of 4 threads and 4 MiB is a good choice for Dask-cuDF with PCIe-connected GPUs, although 8 threads would be often better for a single worker. This would be a good study to do on L4/L40, also on an ARM-PCIe-GPU system. For GH200, it seems like we need to push the thread count higher. We should do some testing with cuDF-Polars and Curator, and then update libcudf to use dynamic settings for KvikIO. |
When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.
Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.
On PCIe-connected x86-H100, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).
More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.
The text was updated successfully, but these errors were encountered: