Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Compreess side for AMD GPUs #144

Open
pkoosha opened this issue Sep 17, 2021 · 8 comments
Open

Performance of Compreess side for AMD GPUs #144

pkoosha opened this issue Sep 17, 2021 · 8 comments

Comments

@pkoosha
Copy link

pkoosha commented Sep 17, 2021

I am using ZFP for 4MB buffer to compress on AMD GPUs and compared to the numbers that I have for NVIDIA there is a huge delay for compression side for compression rate of 8 and 1 dimension data. I was wondering if these numbers sound OK to you and if there is anything could be done to improve the poor compression side. Thanks.

<style> </style>
Compression Section time in seconds   Decompress time in seocnds
compress set-parameters 0.000024398   decompress set-parameters 0.000015471
opening zfp stream 0.000010577   opening zfp stream 0.000002414
pure compression time 1.539804423   decompress bitstream setting 0.000017298
compression dev sync 0.000595486   pure decpmpress time 0.188093374
clean up compress 0.000006698   clean up decpmpress 0.000002639
@lindstro
Copy link
Member

That does seem awfully slow (4 MB in over a second). We are aware that our HIP implementation is quite a bit slower than the CUDA implementation, though we're still observing close to 20 GB/s compression throughput at a rate of 8 bits/value on an AMD MI60 when compressing 3D double-precision data. By comparison, an NVIDIA V100 achieves 120 GB/s.

Are you able to share the data and (preferably) the code so we can take a closer look? Does the uncompressed and compressed data both reside on the GPU?

@cjy7117 Do you have anything to add?

@lindstro
Copy link
Member

@pkoosha A brief update: We've done some work to improve the performance of the HIP implementation. In particular, the high compression (but not decompression) overhead you're seeing is likely due to a GPU cold start, where the ROCM runtime performs lazy initialization during the first kernel launch. We're now explicitly warming the GPU during zfp_stream_set_execution(). We also improved the performance of memory copies between host and device by pinning memory pages. As a result, HIP and CUDA throughput should be roughly similar for comparable graphics cards.

The double-precision performance is still disappointing. We're currently investigating the reasons for this and will provide an update once we know more.

@pkoosha
Copy link
Author

pkoosha commented Feb 17, 2022 via email

@lindstro
Copy link
Member

lindstro commented Feb 9, 2023

@pkoosha FYI, the double precision HIP performance on the staging branch (soon to be merged into develop) has been substantially improved by reducing register spillage. Currently decode performance on an AMD MI250X consistently outperforms NVIDIA V100, even though the staging implementation uses thread blocks of size 32 that do not align well with AMD hardware. We will remove this limitation to allow larger thread blocks, which will further help AMD performance. Below is a recent performance plot for one particular 3D data set. We've seen encode performance reach 800 GB/s on NVIDIA A100, so there's still room for improvement, but the situation is far better than it was.

zfp-decode-throughput-jan2023

@pkoosha
Copy link
Author

pkoosha commented Sep 6, 2023

@lindstro thanks for these numbers! They are promising. could you please let me know which branch contains these designs and how can I ensure I have the best set of practices when running on AMD GPUs? Thank you very much!

@lindstro
Copy link
Member

lindstro commented Sep 6, 2023

This is on the staging branch. The results are based on 3D fields of doubles (I believe one of the Miranda fields from SDRBench). You should not have to do anything in particular on AMD hardware other than enable the HIP execution policy via zfp_stream_set_execution(stream, zfp_exec_hip).

@R0n12
Copy link

R0n12 commented Sep 20, 2023

@lindstro Thank you for the instructions on running ZFP with AMD support! We have collected some throughput numbers on mi100s and a100s with different thread block size.

  • We chose the Hurricane ISABEL single-precision dataset by issuing run commands: zfp -f -i Pf48.bin.f32 -3 500 500 100 -z Pf48.bin.f32.zfp -o Pf48.bin.f32.zfp.out -s
  • We also enabled ZFP internal profiling flag ZFP_WITH_CUDA_PROFILE for the numbers

Picture3
Picture4

Screenshot 2023-09-19 203644

We observed that

  1. A100 doesn't have a noticeable advantage over MI100 either in encode (~180 GB/s vs ~170 GB/s) or decode (~130GB/s vs ~145GB/s)
  2. Changing thread block size doesn't seem to impact performance a lot.

Do you have any comments? Is there any other direction that we should explore?

@lindstro
Copy link
Member

@R0n12 Thanks for sharing these results. I assume you're using the latest staging branch? I just tried to repeat this experiment using the default settings (without messing with the thread block size), and I'm recording 200 GB/s compression throughput on an older V100 at a rate of 1. At a rate of 16, compression throughput is 70 GB/s. Decompression throughput varies from 140 GB/s (rate = 1) to 62 GB/s (rate = 16). That's quite a bit faster than what you're seeing, and on an older generation GPU.

I also have some more recent changes to the staging branch that have not been pushed yet, and those suggest even higher throughput at high rates, e.g., at rate = 16, compression throughput is 140 GB/s; decompression throughput is 82 GB/s. These changes will eventually be merged into develop.

Regarding MI100, I could repeat those experiments with some effort, but I believe a more appropriate comparison would be with MI250.

I don't know if it matters, but I'm not using the -z or -o options as there's no need to write the (de)compressed data to file for these experiments. I am using -s to force decompression, however. And, of course, I'm running with -x cuda, which I'm sure you are, too.

Finally, throughput should be roughly twice as high if the input data were doubles. Moreover, the 2D encoder (using -2 500 50000) on V100 is much faster than the 3D encoder; at a rate of 1, throughput is over 500 GB/s. This likely has to do with register spillage. Decode throughput, however, suffers. We need to investigate these issues further before the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants