[AMD][Atomics, Buffer Ops] Add support for buffer atomic RMW #5549

SamGinzburg · 2025-01-07T18:06:48Z

Overview

This PR enables the raw.ptr.buffer.atomic.* RMW ops in the AMD backend. They feature similar calling conventions and semantics to the other buffer ops in the AMD backend.

The new ops are gated behind the AMDGCN_ENABLE_BUFFER_ATOMICS environment variable which must be used in conjunction with AMDGCN_USE_BUFFER_OPS. They are also gated behind the GPU being CDNA3 (MI300-series GPUs) for now as the optimizations I added make assumptions regarding GFX942.

I originally started exploratory work on the PR to better understand the comment in LoadStoreOpToLLVM.cpp referring to buffer atomics as "more efficient". In short I found that on their own they aren't necessarily more efficient, but using them in conjunction with more careful control over how cache coherence ops/memory fences are emitted can improve performance by a significant fraction.

How

I've added a new buffer atomic RMW op in the AMDGPUOps dialect which has its own lowering in the backend. There are a number of checks in place to ensure that the lowering is done correctly between the ConvertToBufferOps pass and the LoadStoreOpToLLVM lowering.

The actual lowering is where most of the performance gains come from. At a high-level, when non-buffer atomic RMW ops are emitted, the memory fences lower to something along the lines of:

buffer_wbl2 sc1
s_waitcnt lgkmcnt(0)
atomicRMWop()
s_waitcnt vmcnt(0) 
buffer_inv sc1
buffer_wbl2 sc1
s_waitcnt lgkmcnt(0)
atomicRMWop()
s_waitcnt vmcnt(0) 
buffer_inv sc1

If my understanding of the GFX942 memory model is correct, then given several assumptions regarding CDNA3, this can actually be lowered to something that resembles:

buffer_wbl2 sc1
s_waitcnt lgkmcnt(0)
atomicRMWop()
s_waitcnt vmcnt(0) # AMDGCN specific cross-CU synchronization primitive
atomicRMWop()
s_waitcnt vmcnt(0) 
buffer_inv sc1

There are comments in the code which explain the thought process for why (I think) that this is okay.

It appears the AMD's CK library (AMD version of CUTLASS) uses similar synchronization mechanisms, although I am probably missing some of the context here for sure (https://github.com/ROCm/composable_kernel/blob/9e95d54cd2160dffc07c1197951a9ab1ca6c35f2/include/ck_tile/core/arch/amd_buffer_addressing.hpp#L619).

Results and Testing

In addition to the added lit test, I ran the existing atomic rmw tests in tree with buffer ops + buffer atomics enabled and they appear to pass.

Following this, I evaluated FP16 Split-K gemm with llama shapes in tritonbench using an MI300x. Some minor modifications to the kernel were made to emit buffer ops (e.g., tl.assume calls). For testing purposes, I disabled the non split-k configurations. I also checked the numerical accuracy with rtol=atol=1e-4 for all shapes here.

Each bucket in the figure above corresponds to the average TFlops of all shapes with the same shared M-dim.

At smaller batch sizes the performance is roughly equivalent. At BS=32, buffer atomics have ~50% greater TFlops. At BS=256 buffer atomics have ~3.75x the TFlops.

Note: the purpose of this test is to evaluate the performance of buffer atomics---split-k is not always optimal for these shapes/workload etc...

============================================================================================

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

antiagainst

Nice! Thanks for adding support for it! I've a couple of comments. Also, could you turn AMDGCN_USE_BUFFER_OPS on for now so we can test it out? We will turn it back to off before landing.

third_party/amd/python/triton_amd.cc

third_party/amd/backend/compiler.py

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h

third_party/amd/lib/TritonAMDGPUTransforms/ConvertToBufferOps.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/BufferOpsEmitter.cpp

third_party/amd/lib/TritonAMDGPUToLLVM/LoadStoreOpToLLVM.cpp

giuseros

This is an amazing PR! Thanks @SamGinzburg for not only extending buffer support but only coming up with a better lowering for atomic operations! I left few comments and agree what the comments left by @antiagainst !

giuseros · 2025-01-09T10:58:19Z

third_party/amd/include/Dialect/TritonAMDGPU/IR/TritonAMDGPUOps.td

+  TypesMatchWith<"value and mask have the same shape", "value", "mask", "getI1SameShape($_self)",
+                 "($_op.getOperands().size() <= 3) || std::equal_to<>()">,
+]>{
+    let summary = "Load from a scalar base pointer and a tensor offset";


Is this summary correct?

thanks for catching this, I've updated it to be accurate to atomicrmw

giuseros · 2025-01-09T11:02:13Z

third_party/amd/lib/TritonAMDGPUToLLVM/BufferOpsEmitter.cpp

  Type bufferElementType = elementType;
-  if (elementType.isBF16())
+  // We don't want to cast to bf16 if we are emitting buffer atomics


Why? I had few bugs with memory operations when I was not casting bf16 to i16. Are those bugs not there for atomics?

They are there, but present in different forms. Casting to i16 causes an error in LLVM (LLVM translation failed for operation) and passing bf16 through causes issues with instruction selection. There's the second issue that for loads/stores the type of the buffer is less important (just need a correctly sized op, can bitcast later---which is what I believe the code does today). For atomic rmw I think the type needs to be correct (e.g., fadd for fp16 vs bf16 is different).

The instruction does exist (or at least according to the docs it should).

I'm going to try and reach out to the AMD/LLVM team regarding this at some point, but since buffer ops are off by default and I had to disable the triton bf16 atomic fadd check to trigger this I don't think it should necessarily block the PR.

giuseros · 2025-01-09T11:09:47Z

third_party/amd/lib/TritonAMDGPUToLLVM/BufferOpsEmitter.cpp

@@ -164,7 +200,7 @@ void BufferEmitter::fillCommonArgs(Type type, Value rsrcDesc,
  // bit 0: GLC = 0 (atomics drop value, less coherency)
  // bits 1-2: SLC, DLC = 0 (similarly)
  // bit 3: swizzled (0 for raw)
-  Value cacheModifiers = int_val(32, 0);
+  Value cacheModifiers = int_val(32, cacheModifiersFlag);


Not mandatory, but could we sync with @yiqian1 to get her PR merged first? I would like to have the cache modifiers story sorted properly instead of an ad hoc value only for the RMW case. But this is correct anyway, so if @yiqian1 's PR takes too long to be merged, I am happy for you to proceed.

Yeah we can do it either way, I think its up to whichever PR is ready to land first, I don't mind rebasing.

SamGinzburg · 2025-01-10T19:37:57Z

Nice! Thanks for adding support for it! I've a couple of comments. Also, could you turn AMDGCN_USE_BUFFER_OPS on for now so we can test it out? We will turn it back to off before landing.

Thanks! I've set the flag to be true for now!

antiagainst · 2025-01-14T00:17:41Z

test/TritonGPU/amd/amd-convert-buffer-ops.mlir

+    // CHECK: %[[scalar_ptr:.*]] = tt.addptr %arg0
+    %5 = tt.addptr %arg0, %1 : !tt.ptr<f32>, i32
+    %8 = tt.splat %5 : !tt.ptr<f32> -> tensor<1024x!tt.ptr<f32>, #blocked>
+    %9 = tt.addptr %8, %4 : tensor<1024x!tt.ptr<f32>, #blocked>, tensor<1024xi32, #blocked>


We also need to CHECK amdgpu.buffer_atomic_rmw is generated?

antiagainst · 2025-01-14T22:47:03Z

The patch LGTM now; can you resolve the conflicts so we can land @SamGinzburg?

…he return type

…cs etc..?

lint

cleanup more nits lint lint nits

lint update comments nit nit

scxiao · 2025-01-15T15:58:39Z

Hi @SamGinzburg, just wondering whether this lowering optimization applicable to non-buffer-atomics (i.e., global_atomic)? Thanks

SamGinzburg · 2025-01-15T16:07:02Z

Hi @SamGinzburg, just wondering whether this lowering optimization applicable to non-buffer-atomics (i.e., global_atomic)? Thanks

Yes I think so, buffer ops just make it easier to control the lowering. I can put up a follow-up PR which does the same for those, but we will just be emitting inline assembly if that is okay (unless LLVM can add an optimization which automatically does this)

scxiao · 2025-01-15T16:16:21Z

Hi @SamGinzburg, just wondering whether this lowering optimization applicable to non-buffer-atomics (i.e., global_atomic)? Thanks

Yes I think so, buffer ops just make it easier to control the lowering. I can put up a follow-up PR which does the same for those, but we will just be emitting inline assembly if that is okay (unless LLVM can add an optimization which automatically does this)

Thanks for the quick reply. Some other thoughts, I am wondering whether the tl.atomic_add() in the splitK gemm can use the sem input relaxed, see https://triton-lang.org/main/python-api/generated/triton.language.atomic_add.html#triton.language.atomic_add.
The default 'acq_rel` is used to create a critical section for data communication like the example here:

triton/python/test/unit/language/test_core.py

Line 1614 in f9d9fad

def serialized_add(data, Lock, SEM: tl.constexpr):

.

SamGinzburg · 2025-01-15T18:11:26Z

Hi @SamGinzburg, just wondering whether this lowering optimization applicable to non-buffer-atomics (i.e., global_atomic)? Thanks

Yes I think so, buffer ops just make it easier to control the lowering. I can put up a follow-up PR which does the same for those, but we will just be emitting inline assembly if that is okay (unless LLVM can add an optimization which automatically does this)

Thanks for the quick reply. Some other thoughts, I am wondering whether the tl.atomic_add() in the splitK gemm can use the sem input relaxed, see https://triton-lang.org/main/python-api/generated/triton.language.atomic_add.html#triton.language.atomic_add. The default 'acq_rel` is used to create a critical section for data communication like the example here:

triton/python/test/unit/language/test_core.py

Line 1614 in f9d9fad

def serialized_add(data, Lock, SEM: tl.constexpr):

.

Yes that is correct, with sem="relaxed", the performance is equivalent between buffer atomics and regular. When sem="acq_rel" the gap is much larger. e.g., For M=128 N=13312 K=16384, the gap is 75 vs 177 TFLOPs with acq_rel. With sem="relaxed", both get ~228 TFLOPs.

SamGinzburg · 2025-01-16T16:14:58Z

tests currently failing with "urllib.error.HTTPError: HTTP Error 524"---@antiagainst possibly needs to be restarted

This is a minor change, when implementing PR #5549 I used: ```rewriter.notifyMatchFailure``` in place of ```return failure();``` as per suggestions to leverage MLIR infra for errors. We should probably be consistent throughout the file and use the MLIR infra for the other buffer ops.

SamGinzburg requested review from antiagainst, zhanglx13 and ptillet as code owners January 7, 2025 18:06

SamGinzburg mentioned this pull request Jan 7, 2025

[AMD][Backend, Atomics] Remove old comment regarding buffer atomics #5523

Closed

4 tasks

antiagainst requested changes Jan 8, 2025

View reviewed changes

giuseros reviewed Jan 9, 2025

View reviewed changes

SamGinzburg force-pushed the PR-BufferAtomicRMW branch 2 times, most recently from 8f7ce03 to cb1a267 Compare January 10, 2025 19:36

SamGinzburg force-pushed the PR-BufferAtomicRMW branch 2 times, most recently from 458a1aa to 4769d58 Compare January 13, 2025 17:37

antiagainst requested changes Jan 14, 2025

View reviewed changes

antiagainst approved these changes Jan 14, 2025

View reviewed changes

antiagainst force-pushed the PR-BufferAtomicRMW branch from c861075 to ed18ac6 Compare January 14, 2025 22:46

SamGinzburg added 15 commits January 15, 2025 07:40

initial WIP on buffer atomics

35417a9

enable rawptrbuffer atomic ops, still getting an assert in llvm for t…

70f213e

…he return type

more progress on atomics, wip on contiguity analysis

219aa3a

atomics working, need tests and more checks + acquire/release semanti…

7e00692

…cs etc..?

lint

9e8810f

add memory fences + barriers

9a33c3d

emit correct number of barriers

d398dad

gate buffer atomics behind env var

2f92f8d

lint

12b6641

more checks

e4228a6

lint

add lit test

3786f11

cleanup more nits lint lint nits

set GLC=1 for buffer_atomic_add

dd6dec0

check for users before setting GLC

95fdef8

lint update comments nit nit

nits

80cf4e1

nit

f47c4e4

SamGinzburg added 6 commits January 15, 2025 07:46

clean up comment

91f6f97

fix numbering

06ec61e

first round of review

2f2618b

update comment

8484e6b

another comment nit

b0d49bb

fix test

75f9842

rebase after cache modifiers pr

74bb07f

SamGinzburg force-pushed the PR-BufferAtomicRMW branch from ed18ac6 to 74bb07f Compare January 15, 2025 18:10

refactor arguments for buffer atomics

500e94b

antiagainst added 2 commits January 17, 2025 02:47

Disable buffer ops

f9c6dd1

Merge remote-tracking branch 'origin/main' into PR-BufferAtomicRMW

d9187cf

antiagainst merged commit 6556ec6 into triton-lang:main Jan 17, 2025
7 checks passed

SamGinzburg mentioned this pull request Jan 27, 2025

[AMD][Buffer Ops] Leverage MLIR infra for errors in more places #5719

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][Atomics, Buffer Ops] Add support for buffer atomic RMW #5549

[AMD][Atomics, Buffer Ops] Add support for buffer atomic RMW #5549

SamGinzburg commented Jan 7, 2025 •

edited

Loading

antiagainst left a comment

giuseros left a comment

giuseros Jan 9, 2025

SamGinzburg Jan 10, 2025

giuseros Jan 9, 2025

SamGinzburg Jan 10, 2025 •

edited

Loading

giuseros Jan 9, 2025

SamGinzburg Jan 10, 2025

SamGinzburg commented Jan 10, 2025

antiagainst Jan 14, 2025

antiagainst commented Jan 14, 2025

scxiao commented Jan 15, 2025

SamGinzburg commented Jan 15, 2025 •

edited

Loading

scxiao commented Jan 15, 2025

SamGinzburg commented Jan 15, 2025

SamGinzburg commented Jan 16, 2025

[AMD][Atomics, Buffer Ops] Add support for buffer atomic RMW #5549

[AMD][Atomics, Buffer Ops] Add support for buffer atomic RMW #5549

Conversation

SamGinzburg commented Jan 7, 2025 • edited Loading

Overview

How

Results and Testing

New contributor declaration

antiagainst left a comment

Choose a reason for hiding this comment

giuseros left a comment

Choose a reason for hiding this comment

giuseros Jan 9, 2025

Choose a reason for hiding this comment

SamGinzburg Jan 10, 2025

Choose a reason for hiding this comment

giuseros Jan 9, 2025

Choose a reason for hiding this comment

SamGinzburg Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

giuseros Jan 9, 2025

Choose a reason for hiding this comment

SamGinzburg Jan 10, 2025

Choose a reason for hiding this comment

SamGinzburg commented Jan 10, 2025

antiagainst Jan 14, 2025

Choose a reason for hiding this comment

antiagainst commented Jan 14, 2025

scxiao commented Jan 15, 2025

SamGinzburg commented Jan 15, 2025 • edited Loading

scxiao commented Jan 15, 2025

SamGinzburg commented Jan 15, 2025

SamGinzburg commented Jan 16, 2025

SamGinzburg commented Jan 7, 2025 •

edited

Loading

SamGinzburg Jan 10, 2025 •

edited

Loading

SamGinzburg commented Jan 15, 2025 •

edited

Loading