HIP: add option to embed static blockDim #105

Maetveis · 2023-02-14T17:26:07Z

Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values.
This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better.

The added option useStaticWorkGroupSize has three possible values:

-1: Disable embedding blockDim sizes, effectively the old behavior
0: Automatically enable embedding when profitable (always except for RDNA2)
1: Always enable

RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at 1

Performance of vkFFT sample 3, show 20-200% increase in bandwitdh particularly with smaller ffts. These line up with the numbers obtained from the HIP port of Gromacs where this optimization was originally applied.
Benchmarks with MI210 and V620 (equivalent to RX6900xt)
vkfft_hip_embed_blockdim.ods

Co-authored-by: [email protected]

Using blockDim in hip kernels unfortunately incurs a large overhead, because this (dynamic) information is stored in the dispatch packet located in a host-coherent memory region. Since vkFFT always knows the work group size its going to use, just replace uses of blockDim with these values. This means the load from non-cached memory is avoided, the dispatch pointer doesn't have to be loaded which frees up 2 SGPRs, and some indexing calculations might constant fold better. The added option `useStaticWorkGroupSize` has three possible values: - -1: Disable embedding blockDim sizes, effectively the old behavior - 0: Automatically enable embedding when profitable (always except for RDNA2) - 1: Always enable RDNA is disabled by default because this can actually decrease performance sometimes with the reason not fully known, details at [1] [1]: ROCm/hipamd#53 Co-authored-by: [email protected]

DTolm · 2023-03-02T12:39:24Z

Hello,

Sorry for the long reply. I am currently finishing version 1.3.0 of VkFFT with major changes to the library design aimed at better maintainability and it already does not use global indexing with blockDim. I will upload it to the develop branch of the VkFFT repo next week.

As for this pr, I can confirm that it makes code faster on MI200, however it breaks the global indexing if the axis are swapped (line 25958, for example). So I would rather leave the 1.2.33 version as it is and wait a bit for 1.3.0.

Best regards,
Dmitrii

DTolm · 2023-10-23T11:37:57Z

This has been implemented in v1.3.0. Thanks!

DTolm closed this Oct 23, 2023

bdenhollander mentioned this pull request Nov 11, 2023

Use VkFFT from OpenMM 8.1 amd/openmm-hip#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIP: add option to embed static blockDim #105

HIP: add option to embed static blockDim #105

Maetveis commented Feb 14, 2023

DTolm commented Mar 2, 2023

DTolm commented Oct 23, 2023

HIP: add option to embed static blockDim #105

HIP: add option to embed static blockDim #105

Conversation

Maetveis commented Feb 14, 2023

DTolm commented Mar 2, 2023

DTolm commented Oct 23, 2023