Performance of Compreess side for AMD GPUs #144

pkoosha · 2021-09-17T16:43:25Z

I am using ZFP for 4MB buffer to compress on AMD GPUs and compared to the numbers that I have for NVIDIA there is a huge delay for compression side for compression rate of 8 and 1 dimension data. I was wondering if these numbers sound OK to you and if there is anything could be done to improve the poor compression side. Thanks.

Compression Section	time in seconds	Decompress	time in seocnds
compress set-parameters	0.000024398	decompress set-parameters	0.000015471
opening zfp stream	0.000010577	opening zfp stream	0.000002414
pure compression time	1.539804423	decompress bitstream setting	0.000017298
compression dev sync	0.000595486	pure decpmpress time	0.188093374
clean up compress	0.000006698	clean up decpmpress	0.000002639

lindstro · 2021-09-17T17:13:46Z

That does seem awfully slow (4 MB in over a second). We are aware that our HIP implementation is quite a bit slower than the CUDA implementation, though we're still observing close to 20 GB/s compression throughput at a rate of 8 bits/value on an AMD MI60 when compressing 3D double-precision data. By comparison, an NVIDIA V100 achieves 120 GB/s.

Are you able to share the data and (preferably) the code so we can take a closer look? Does the uncompressed and compressed data both reside on the GPU?

@cjy7117 Do you have anything to add?

lindstro · 2022-02-17T17:44:26Z

@pkoosha A brief update: We've done some work to improve the performance of the HIP implementation. In particular, the high compression (but not decompression) overhead you're seeing is likely due to a GPU cold start, where the ROCM runtime performs lazy initialization during the first kernel launch. We're now explicitly warming the GPU during zfp_stream_set_execution(). We also improved the performance of memory copies between host and device by pinning memory pages. As a result, HIP and CUDA throughput should be roughly similar for comparable graphics cards.

The double-precision performance is still disappointing. We're currently investigating the reasons for this and will provide an update once we know more.

pkoosha · 2022-02-17T17:47:22Z

Sweet! Thanks for the update! Let me know once you have the numbers and I am excited to see how it goes! Best, Pouya Kousha

…

________________________________ From: Peter Lindstrom ***@***.***> Sent: Thursday, February 17, 2022 12:44:37 PM To: LLNL/zfp ***@***.***> Cc: pkoosha ***@***.***>; Mention ***@***.***> Subject: Re: [LLNL/zfp] Performance of Compreess side for AMD GPUs (#144) @pkoosha<https://urldefense.com/v3/__https://github.com/pkoosha__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1O9S_vpq$> A brief update: We've done some work to improve the performance of the HIP implementation. In particular, the high compression (but not decompression) overhead you're seeing is likely due to a GPU cold start, where the ROCM runtime performs lazy initialization during the first kernel launch. We're now explicitly warming the GPU during zfp_stream_set_execution(). We also improved the performance of memory copies between host and device by pinning memory pages. As a result, HIP and CUDA throughput should be roughly similar for comparable graphics cards. The double-precision performance is still disappointing. We're currently investigating the reasons for this and will provide an update once we know more. — Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LLNL/zfp/issues/144*issuecomment-1043233536__;Iw!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1DlT4bwE$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKOA4LHCV2DGCCYK3Z5S2VTU3UXYLANCNFSM5EH55ZDA__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1L_BaD45$>. Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1MhFKj6_$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1Ho1-0jl$>. You are receiving this because you were mentioned.Message ID: ***@***.***>

lindstro · 2023-02-09T22:26:22Z

@pkoosha FYI, the double precision HIP performance on the staging branch (soon to be merged into develop) has been substantially improved by reducing register spillage. Currently decode performance on an AMD MI250X consistently outperforms NVIDIA V100, even though the staging implementation uses thread blocks of size 32 that do not align well with AMD hardware. We will remove this limitation to allow larger thread blocks, which will further help AMD performance. Below is a recent performance plot for one particular 3D data set. We've seen encode performance reach 800 GB/s on NVIDIA A100, so there's still room for improvement, but the situation is far better than it was.

pkoosha · 2023-09-06T17:08:55Z

@lindstro thanks for these numbers! They are promising. could you please let me know which branch contains these designs and how can I ensure I have the best set of practices when running on AMD GPUs? Thank you very much!

lindstro · 2023-09-06T17:59:35Z

This is on the staging branch. The results are based on 3D fields of doubles (I believe one of the Miranda fields from SDRBench). You should not have to do anything in particular on AMD hardware other than enable the HIP execution policy via zfp_stream_set_execution(stream, zfp_exec_hip).

R0n12 · 2023-09-20T01:09:21Z

@lindstro Thank you for the instructions on running ZFP with AMD support! We have collected some throughput numbers on mi100s and a100s with different thread block size.

We chose the Hurricane ISABEL single-precision dataset by issuing run commands: zfp -f -i Pf48.bin.f32 -3 500 500 100 -z Pf48.bin.f32.zfp -o Pf48.bin.f32.zfp.out -s
We also enabled ZFP internal profiling flag ZFP_WITH_CUDA_PROFILE for the numbers

We observed that

A100 doesn't have a noticeable advantage over MI100 either in encode (~180 GB/s vs ~170 GB/s) or decode (~130GB/s vs ~145GB/s)
Changing thread block size doesn't seem to impact performance a lot.

Do you have any comments? Is there any other direction that we should explore?

lindstro · 2023-09-20T16:18:47Z

@R0n12 Thanks for sharing these results. I assume you're using the latest staging branch? I just tried to repeat this experiment using the default settings (without messing with the thread block size), and I'm recording 200 GB/s compression throughput on an older V100 at a rate of 1. At a rate of 16, compression throughput is 70 GB/s. Decompression throughput varies from 140 GB/s (rate = 1) to 62 GB/s (rate = 16). That's quite a bit faster than what you're seeing, and on an older generation GPU.

I also have some more recent changes to the staging branch that have not been pushed yet, and those suggest even higher throughput at high rates, e.g., at rate = 16, compression throughput is 140 GB/s; decompression throughput is 82 GB/s. These changes will eventually be merged into develop.

Regarding MI100, I could repeat those experiments with some effort, but I believe a more appropriate comparison would be with MI250.

I don't know if it matters, but I'm not using the -z or -o options as there's no need to write the (de)compressed data to file for these experiments. I am using -s to force decompression, however. And, of course, I'm running with -x cuda, which I'm sure you are, too.

Finally, throughput should be roughly twice as high if the input data were doubles. Moreover, the 2D encoder (using -2 500 50000) on V100 is much faster than the 3D encoder; at a rate of 1, throughput is over 500 GB/s. This likely has to do with register spillage. Decode throughput, however, suffers. We need to investigate these issues further before the next release.

lindstro added the performance label Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of Compreess side for AMD GPUs #144

Performance of Compreess side for AMD GPUs #144

pkoosha commented Sep 17, 2021

lindstro commented Sep 17, 2021

lindstro commented Feb 17, 2022

pkoosha commented Feb 17, 2022 via email

lindstro commented Feb 9, 2023

pkoosha commented Sep 6, 2023 •

edited

Loading

lindstro commented Sep 6, 2023

R0n12 commented Sep 20, 2023

lindstro commented Sep 20, 2023

Performance of Compreess side for AMD GPUs #144

Performance of Compreess side for AMD GPUs #144

Comments

pkoosha commented Sep 17, 2021

lindstro commented Sep 17, 2021

lindstro commented Feb 17, 2022

pkoosha commented Feb 17, 2022 via email

lindstro commented Feb 9, 2023

pkoosha commented Sep 6, 2023 • edited Loading

lindstro commented Sep 6, 2023

R0n12 commented Sep 20, 2023

lindstro commented Sep 20, 2023

pkoosha commented Sep 6, 2023 •

edited

Loading