-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of Compreess side for AMD GPUs #144
Comments
That does seem awfully slow (4 MB in over a second). We are aware that our HIP implementation is quite a bit slower than the CUDA implementation, though we're still observing close to 20 GB/s compression throughput at a rate of 8 bits/value on an AMD MI60 when compressing 3D double-precision data. By comparison, an NVIDIA V100 achieves 120 GB/s. Are you able to share the data and (preferably) the code so we can take a closer look? Does the uncompressed and compressed data both reside on the GPU? @cjy7117 Do you have anything to add? |
@pkoosha A brief update: We've done some work to improve the performance of the HIP implementation. In particular, the high compression (but not decompression) overhead you're seeing is likely due to a GPU cold start, where the ROCM runtime performs lazy initialization during the first kernel launch. We're now explicitly warming the GPU during The double-precision performance is still disappointing. We're currently investigating the reasons for this and will provide an update once we know more. |
Sweet! Thanks for the update! Let me know once you have the numbers and I am excited to see how it goes!
Best,
Pouya Kousha
…________________________________
From: Peter Lindstrom ***@***.***>
Sent: Thursday, February 17, 2022 12:44:37 PM
To: LLNL/zfp ***@***.***>
Cc: pkoosha ***@***.***>; Mention ***@***.***>
Subject: Re: [LLNL/zfp] Performance of Compreess side for AMD GPUs (#144)
@pkoosha<https://urldefense.com/v3/__https://github.com/pkoosha__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1O9S_vpq$> A brief update: We've done some work to improve the performance of the HIP implementation. In particular, the high compression (but not decompression) overhead you're seeing is likely due to a GPU cold start, where the ROCM runtime performs lazy initialization during the first kernel launch. We're now explicitly warming the GPU during zfp_stream_set_execution(). We also improved the performance of memory copies between host and device by pinning memory pages. As a result, HIP and CUDA throughput should be roughly similar for comparable graphics cards.
The double-precision performance is still disappointing. We're currently investigating the reasons for this and will provide an update once we know more.
—
Reply to this email directly, view it on GitHub<https://urldefense.com/v3/__https://github.com/LLNL/zfp/issues/144*issuecomment-1043233536__;Iw!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1DlT4bwE$>, or unsubscribe<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKOA4LHCV2DGCCYK3Z5S2VTU3UXYLANCNFSM5EH55ZDA__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1L_BaD45$>.
Triage notifications on the go with GitHub Mobile for iOS<https://urldefense.com/v3/__https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675__;!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1MhFKj6_$> or Android<https://urldefense.com/v3/__https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign*3Dnotification-email*26utm_medium*3Demail*26utm_source*3Dgithub__;JSUlJSU!!KGKeukY!hslyLcviO6XhfBy_X8qy2N9CL1JnFLeFboayDQGgr1pykrcBeIs-Qqfc1Ho1-0jl$>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@pkoosha FYI, the double precision HIP performance on the |
@lindstro thanks for these numbers! They are promising. could you please let me know which branch contains these designs and how can I ensure I have the best set of practices when running on AMD GPUs? Thank you very much! |
This is on the |
@lindstro Thank you for the instructions on running ZFP with AMD support! We have collected some throughput numbers on mi100s and a100s with different thread block size.
We observed that
Do you have any comments? Is there any other direction that we should explore? |
@R0n12 Thanks for sharing these results. I assume you're using the latest I also have some more recent changes to the Regarding MI100, I could repeat those experiments with some effort, but I believe a more appropriate comparison would be with MI250. I don't know if it matters, but I'm not using the Finally, throughput should be roughly twice as high if the input data were doubles. Moreover, the 2D encoder (using |
I am using ZFP for 4MB buffer to compress on AMD GPUs and compared to the numbers that I have for NVIDIA there is a huge delay for compression side for compression rate of 8 and 1 dimension data. I was wondering if these numbers sound OK to you and if there is anything could be done to improve the poor compression side. Thanks.
<style> </style>The text was updated successfully, but these errors were encountered: