-
Notifications
You must be signed in to change notification settings - Fork 37
Implement float/double atomicMin/Max in terms of integer atomics #65
Comments
(not an AMD employee) The second one is maybe best left to the application developers who will know if there will be memory address conflicts? |
Maybe interesting to @arsenm from a compiler perspective and who can maybe recommend someone to have a look at for the HIP implementation of the atomics? |
This would be a reasonable position if HIP exposed the necessary primitives for application developers to implement the functionality efficiently themselves. For example, on NVIDIA hardware, one would do something along the lines of: __device__ int atomicAggInc(int *ptr) {
int mask = __match_any_sync(__activemask(), (unsigned long long)ptr);
int leader = __ffs(mask) – 1; // select a leader
int res;
if(lane_id() == leader) // leader does the update
res = atomicAdd(ptr, __popc(mask));
res = __shfl_sync(mask, res, leader); // get leader’s old value
return res + __popc(mask & ((1 << lane_id()) – 1)); //compute old value
} Here, most of the heavy lifting is done by the SM7.0+ primitive |
I don't know if for cuda there are really special SSA instructions for the |
It reduces to the PTX |
Tricks like turning FP atomics into integer atomics would be best done in the backend (though FP atomicrmw does not have fast math flags, so that's a bit annoying). We do have a pass which tries to rewrite uniform atomics to do a single atomic op and a reduction in the wave. It's currently not enabled by default. |
Hi @FreddieWitherden , Would be interested in seeing what you think about results you can get with a benchmark like the following:
What I observe when running this on a gfx90c iGPU (Ryzen 5800U) is the following:
These observations seem to correspond to the ones at section 5 of https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p0493r3.pdf : atomic adds are "read-modify-write", but atomic min/max are "read-and-conditional-store". So for atomic min/max, it might not be better to just re-implement them with the approach in your first message: CAS might be better. But maybe if combining with the local reduction in your next message, it might be better than CAS? EDIT: fixed code to reset counters between runs so that work happens |
I think the issue is more subtle than that. Everything you said about a CAS loop for Here, there are two possibilities (i) that the HIP min/max functions for integers are implemented poorly on your platform, or (ii) that they've been optimised for a different level of contention than your test program exhibits. The first case is certainly possible given most HIP development is focused on CDNA rather than GCN and RDNA. |
I had a bug in my benchmark where actually no work was happening for later iterations due to the counters being changed not being reset. After fixing that, if I run my benchmark and vary the contention by changing the stride parameter, we can see that the CAS is only better for the highest level of contention (stride == 1).
So indeed, might be better to implement atomic min/max that way. |
I'll note that these ideas also extend to double precision, with one catch. HIP does not provide |
Given that it is available in CUDA, if you create a ticket just for this, they will probably fix it. |
@FreddieWitherden , if you are interested, here is an example implementation of
Such intrinsic corresponds to "WaveMatch" in HLSL shader model 6.5 (https://microsoft.github.io/DirectX-Specs/d3d/HLSL_ShaderModel6_5.html#wavematch-function) supported by both Nvidia and AMD GPUs (Vega+ as far as I can tell). I think all the warp intrinsics supported on Nvidia GPUs could be supported as well on AMD GPUs. It is quite surprising to me that they were not all implemented yet. Best regards, |
Currently, HIP implements atomicMin/Max for single and double precision floating point values as CAS loops. However, in fast math scenarios, on architectures with hardware support for signed/unsigned integer atomicMin/Max a better implementation is possible. As per https://stackoverflow.com/a/72461459 for single precision:
Better implementations still are possible on NVIDIA using Opportunistic Warp-level Programming wherein one first looks to see if any other active threads in the warp have the same
addr
, and if so first do the reduction at the warp level. This greatly cuts down the number of RMW operations which leave the core when there is contention. I suspect a similar idea can carry over to AMD GPUs.The text was updated successfully, but these errors were encountered: