Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP: add option to embed static blockDim #105

Closed
wants to merge 1 commit into from

Conversation

Maetveis
Copy link
Contributor

Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values.
This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better.

The added option useStaticWorkGroupSize has three possible values:

  • -1: Disable embedding blockDim sizes, effectively the old behavior
  • 0: Automatically enable embedding when profitable (always except for RDNA2)
  • 1: Always enable

RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at 1

Performance of vkFFT sample 3, show 20-200% increase in bandwitdh particularly with smaller ffts. These line up with the numbers obtained from the HIP port of Gromacs where this optimization was originally applied.
Benchmarks with MI210 and V620 (equivalent to RX6900xt)
vkfft_hip_embed_blockdim.ods

Co-authored-by: [email protected]

Using blockDim in hip kernels unfortunately incurs a large overhead,
because this (dynamic) information is stored in the dispatch packet
located in a host-coherent memory region. Since vkFFT always knows the
work group size its going to use, just replace uses of blockDim with
these values.
This means the load from non-cached memory is avoided, the dispatch
pointer doesn't have to be loaded which frees up 2 SGPRs, and some
indexing calculations might constant fold better.

The added option `useStaticWorkGroupSize` has three possible values:
- -1: Disable embedding blockDim sizes, effectively the old behavior
-  0: Automatically enable embedding when profitable (always except for RDNA2)
-  1: Always enable

RDNA is disabled by default because this can actually decrease performance
sometimes with the reason not fully known, details at [1]

[1]: ROCm/hipamd#53

Co-authored-by: [email protected]
@DTolm
Copy link
Owner

DTolm commented Mar 2, 2023

Hello,

Sorry for the long reply. I am currently finishing version 1.3.0 of VkFFT with major changes to the library design aimed at better maintainability and it already does not use global indexing with blockDim. I will upload it to the develop branch of the VkFFT repo next week.

As for this pr, I can confirm that it makes code faster on MI200, however it breaks the global indexing if the axis are swapped (line 25958, for example). So I would rather leave the 1.2.33 version as it is and wait a bit for 1.3.0.

Best regards,
Dmitrii

@DTolm
Copy link
Owner

DTolm commented Oct 23, 2023

This has been implemented in v1.3.0. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants