Improve throughput for hot cache file reading on GH200 #629

GregoryKimball · 2025-02-08T23:00:33Z

When reading hot cache files with KvikIO's threadpool, we see good utilization of the PCIe bandwidth on x86-H100 systems. However, we see poor utilization of the C2C bandwidth on GH200 systems.

Here is an example that writes a 1.2 GB parquet, uncompressed and plain encoded, and reads it as a hot cache file and as a host buffer.

import cudf
import cupy
import rmm
import nvtx
import time
from io import BytesIO

rmm.mr.set_current_device_resource(rmm.mr.CudaAsyncMemoryResource())

nrows= int(1.6 * 10**8)
df = cudf.DataFrame({
    'a': cupy.random.rand(nrows)
})
df.to_parquet(
    '/raid/gkimball/tmp.pq',
    compression=None,   
    column_encoding='PLAIN',    
)


for r in range(10):
    with nvtx.annotate(f"read hot cache file"):
        t0 = time.time()
        _ = cudf.read_parquet('/raid/gkimball/tmp.pq')
        t1 = time.time()
        print(f"read hot cache file: {t1-t0}")


buf = BytesIO()
df.to_parquet(
    buf,
    compression=None,   
    column_encoding='PLAIN',    
)

for r in range(10):
    with nvtx.annotate("read host buffer"):
        buf.seek(0)
        t0 = time.time()
        _ = cudf.read_parquet(buf)
        t1 = time.time()
        print(f"read host buffer: {t1-t0}")

On PCIe-connected x86-H100, we see that the hot cache file takes 63 ms and the host buffer takes 130 ms. This suggests that the KvikIO threadpool may be more efficient than the CUDA driver at moving pageable host data over the PCIe bus. (so perhaps we should consider re-opening #456).

More importantly, on GH200 we see that the hot cache file takes 60 ms and the host buffer takes 13 ms. This suggests that the KvikIO threadpool is much less efficient than the CUDA driver at moving pageable host data over the C2C interconnect. We should develop a new default setting for file reading on GH200 that reaches closer to the throughput of pageable host buffer copying.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2025-02-08T23:01:23Z

629 profiles.zip

GregoryKimball · 2025-02-11T15:00:34Z

So far the fastest I could go was about 50 GiB/s on GH200. If you go wide to use all 72 threads, and increase the task size and bounce buffer size to 16 MB. You can push things a bit better.

You see all the threads spin up, but it takes 9 ms before the first copy happens. Somehow the threads are getting serialized. Maybe in the OS memory management system somewhere.

GregoryKimball · 2025-03-07T23:40:26Z

It looks like we can go faster on GH200 by dramatically increasing the number of host threads. I did testing with a hot cache 1.28 GB uncompressed and plain encoded parquet file.

For GH200, we can continue the scale the number of host threads up to the full 72, and throughput improves. The best throughput for this file was 70 GB/s, with a task size of 1 MiB and 64 threads.

For Viking (Xeon 8480CL CPU, 2 sockets, 224 total threads), we see a narrower performance window, with peak throughput at 16 threads and 4 MiB task size.

We will need to check a few more file sizes to add to this story, but it seems like KvikIO defaults should be set based on the hardware, and this will be a path to better out-of-the-box experience on GH200.

GregoryKimball · 2025-03-07T23:49:40Z

Exploring a similar comparison with a small, 320 GB file, gives similar results.

For GH200, the 32 thread performance improves and the >8 MiB task size degrades

For Viking, the 8 thread performance improves and 4 MiB task size remains optimal.

Looking closer at the 32 thread, 1 MiB task case on GH200, the 320 MB file shows better utilization of the C2C, around 18%. This is still less than the bulk pageable host buffer case which sustains 26% C2C utilization, but much better than the current default around 4%.

Overall, for x86-H100 it seems that the default setting of 4 threads and 4 MiB is a good choice for Dask-cuDF with PCIe-connected GPUs, although 8 threads would be often better for a single worker. This would be a good study to do on L4/L40, also on an ARM-PCIe-GPU system.

For GH200, it seems like we need to push the thread count higher. We should do some testing with cuDF-Polars and Curator, and then update libcudf to use dynamic settings for KvikIO.

GregoryKimball assigned kingcrimsontianyu Feb 8, 2025

kingcrimsontianyu mentioned this issue Feb 22, 2025

Improve parallel POSIX read performance #641

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve throughput for hot cache file reading on GH200 #629

Improve throughput for hot cache file reading on GH200 #629

GregoryKimball commented Feb 8, 2025 •

edited

Loading

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 11, 2025

GregoryKimball commented Mar 7, 2025

GregoryKimball commented Mar 7, 2025 •

edited

Loading

Improve throughput for hot cache file reading on GH200 #629

Improve throughput for hot cache file reading on GH200 #629

Comments

GregoryKimball commented Feb 8, 2025 • edited Loading

GregoryKimball commented Feb 8, 2025

GregoryKimball commented Feb 11, 2025

GregoryKimball commented Mar 7, 2025

GregoryKimball commented Mar 7, 2025 • edited Loading

GregoryKimball commented Feb 8, 2025 •

edited

Loading

GregoryKimball commented Mar 7, 2025 •

edited

Loading