Implement conversion from FMA dot operand to linear layout #5469

binarman · 2024-12-19T21:34:58Z

This PR

Introduces FMA dot operand converter to linear layout, related tests
Fixes FMA generation. previous version had incompatible repetitions with blocked layout

lezcano · 2024-12-19T22:00:44Z

lib/Dialect/TritonGPU/IR/Dialect.cpp

+// Returns ["dim0", "dim1", ..., "dim<rank-1>"] in given order.
+SmallVector<StringAttr> orderedOutDimNames(MLIRContext *ctx,
+                                           ArrayRef<unsigned> order) {
+  auto rank = order.size();
+  SmallVector<StringAttr> ret;
+  for (int i = 0; i < rank; i++) {
+    ret.push_back(StringAttr::get(ctx, "dim" + llvm::Twine(order[i])));
+  }
+  return ret;
+}
+


This function is already somewhere else, could you move it from there?
cc @Mogball candidate to have in that file of LL utils

I did not find exactly this function, but found permuteDimNames, will use it.

It's standardOutDimNames.
Also see #5470 for a nice place to put that utility function

Ah, you wanted the dims permuted, yeah, use permuteDimNames and the standard one perhaps

Yes, combineCtaCgaWithShape implicitly calculates order of repetitions from ctaLayout argument, so I transpose all cta components in this order with transposeOutsmethod.

This PR introduces FMA dot operand converter and related tests.

- Fix compiler crashes in FMA.cpp - Fix lit test

binarman · 2024-12-20T15:01:21Z

test/Conversion/amd/decompose-unsupported-conversions.mlir

@@ -97,11 +97,11 @@ module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.targ
 #blocked = #ttg.blocked<{sizePerThread = [1, 32], threadsPerWarp = [32, 2], warpsPerCTA = [2, 2], order = [1, 0]}>
 #blocked1 = #ttg.blocked<{sizePerThread = [1, 32], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
 module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx940", "ttg.threads-per-warp" = 64 : i32} {
-  tt.func @neg_blocked_to_dot_op_incompatible_warp_gfx940(%arg0: tensor<32x32xf16, #blocked>) {
+  tt.func @neg_blocked_to_dot_op_incompatible_warp_gfx940(%arg0: tensor<128x128xf16, #blocked>) {


made this tensor larger, because with introduction of linear layout input and output tensors turned out to be compatible.

unittest/Dialect/TritonGPU/LinearLayoutConversionsTest.cpp

include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td

lezcano

Nice! A few comments but overall looks good!

Do we have any test that exercises the FMA to LLVM lowering on the AMD side?

lezcano · 2024-12-20T16:18:28Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

+  if (!verifyCTALayout(dLayout.getCTALayout()))
+    return Value();


Does this mean that this path is still preferred over LLs? Could you also make LLs the preferred path, or is there anything blocking us from doing so?

I believe LL is already preferred, but I don't want to remove legacy converter yet, just in case.

but in which case did you hit the hard error? Can we just revert these changes?

I did not hit any errors so far. But want to compare code that is generated by legacy and new converter to see if there are differences in perf/register usage, etc.

Perhaps now this can be reverted before merging?

lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/FMA.cpp

lib/Conversion/TritonGPUToLLVM/MemoryOpToLLVM.cpp

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

Jokeren · 2024-12-20T16:16:33Z

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandFMA.cpp

@@ -292,6 +293,11 @@ Value loadFMAOp(Value srcVal, Value llVal, BlockedEncodingAttr dLayout,
  auto numBTiles = std::max(1u, B / shapePerCTABTile);
  auto numNonKTiles = std::max(1u, NonK / shapePerCTANonKTile);

+  // Found discrepancy in this case,


You meant this is a TODO?

yes, will reword this.

Jokeren · 2024-12-20T16:27:56Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+  // TODO: use operandLayout.getThreadOrder()
+  auto threadOrder = blocked.getThreadOrder();
+  auto warpOrder = blocked.getWarpOrder();
+  auto repOrder = blocked.getRepOrder();


Should it be operandLayout.getRepOrder?

It is not implemented at the moment for blocked parent.

We had a conversation about dot operand order functions with @lezcano and did not come to a certain decision.

So for now I am just using parent order functions everywhere.

Jokeren · 2024-12-20T16:30:00Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+  auto regOrder = blocked.getOrder();
+  // TODO: use operandLayout.getThreadOrder()
+  auto threadOrder = blocked.getThreadOrder();
+  auto warpOrder = blocked.getWarpOrder();


I feel like there's something wrong here. Have you tested a warp shape of [2, 2] with more than 1 warp on the k dimension?

My point is that warps have to be broadcasted along the k dimension

Agreed. I think you have to use warpsDotOperand here.

Have you tested a warp shape of [2, 2] with more than 1 warp on the k dimension?

Yes, this converter works with any warp shape. The trick is I create "thread" part of layout using whole K dimension, so any number of warps or threads across k dimension will be squeezed in combineCtaCgaWithShape call.

Let's take an example with dot A operand with shape [m=32, k=32]:
parent layout is perThreadShape=[1, 1], threads=[8, 4], warps=[2,2]

"per-thread" layout identityStandardND(kReg, threadSize, regOrder) will cover shape [m=1, k=32]

"per-warp" layout ... * identityStandardND(kLane, threadShape, threadOrder) will cover shape [m=8, k=32*4=128]

"full" layout ... * warpsDotOperand(ctx, warpShape, warpOrder, kDimIdx) will cover shape [m=16, k=3282=256]

Then I apply combineCtaCgaWithShape it repeats m dimension two times, but "broadcasts" k dimension down to 32, so all threads and warps across K dimension holds same values.

Agreed. I think you have to use warpsDotOperand here.

I thought about this, the only reason I choose to go without it for now is aestetic:

At this moment cta tile constructions looks like this:

LinearLayout ctaLayout = identityStandardND(kReg, threadSize, regOrder) .transposeOuts(repDimNames) * identityStandardND(kLane, threadShape, threadOrder) .transposeOuts(repDimNames) * identityStandardND(kWarp, warpShape, warpOrder) .transposeOuts(repDimNames);

with warpsDotOperand:

LinearLayout ctaLayout = identityStandardND(kReg, threadSize, regOrder) .transposeOuts(repDimNames) * identityStandardND(kLane, threadShape, threadOrder) .transposeOuts(repDimNames) * warpsDotOperand(ctx, warpShape, warpOrder, kDimIdx) .transposeOuts(repDimNames);

in second variant layout is still extends beyond K dimension due to lane component. to make it uniform I can introduce something like laneDotOperand, but this function will be used only in one place.

I think the per thread layout would be

[r0, r1]
[r2, r3]

Oh, [2*2] is a blocked parent layout property.

Dot operand is slightly different. It inherits all attributes of a parent, except k dimension. Dot operand layout implicitly extends per thread size to [2, K] for A operand and [K, 2] for B operand

Picture you mentioned is related to intermediate layout, before it is expanded with combineCtaCgaWithShape

Got it. I think it's different from the cases where the parent is mma; we don't do implicit broadcasting on only the register dimension. That being said, I think some code clean ups have to happen later. Right now, several methods crash on this dotoperand, like getElemsPerThread

The update looks good to me now. Thanks!

Jokeren · 2024-12-20T16:32:59Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+  if (auto blockedLayout = mlir::dyn_cast<BlockedEncodingAttr>(parent)) {
+    return fmaDotToLinearLayout(*this, shape);
+  }
+  if (auto mfmaLayout = mlir::dyn_cast<AMDMfmaEncodingAttr>(parent)) {


nit:

Suggested change

if (auto mfmaLayout = mlir::dyn_cast<AMDMfmaEncodingAttr>(parent)) {

else if (auto mfmaLayout = mlir::dyn_cast<AMDMfmaEncodingAttr>(parent)) {

binarman · 2024-12-20T19:40:10Z

@lezcano

Do we have any test that exercises the FMA to LLVM lowering on the AMD side?

yes, sadly we actually this only for AMD at this moment: https://github.com/triton-lang/triton/blob/main/python/test/unit/language/test_core.py#L3240

I verified that it worked for Nvidia manually, but I don't think this is tested in CI at the moment

- cleanup hash function in FMA.cpp - add more details in TODO in SharedToDotOperandFMA.cpp - cleanup DotOperandEncodingAttr::toLinearLayout

binarman · 2024-12-20T21:34:42Z

@lezcano @Jokeren Could you take a look again, please?

Jokeren · 2024-12-22T16:41:07Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

+  auto regOrder = blocked.getOrder();
+  // TODO: use operandLayout.getThreadOrder()
+  auto threadOrder = blocked.getThreadOrder();
+  auto warpOrder = blocked.getWarpOrder();


The update looks good to me now. Thanks!

binarman mentioned this pull request Dec 19, 2024

Support blocked dot operand layout conversion to linear layout #5423

Open

lezcano reviewed Dec 19, 2024

View reviewed changes

binarman added 3 commits December 20, 2024 14:39

Implement conversion from FMA dot operand to linear layout

0ac924a

This PR introduces FMA dot operand converter and related tests.

fix repetitions in FMA dot inputs and outputs

69c3354

- Remove orderedOutDimNames function

547f7f8

- Fix compiler crashes in FMA.cpp - Fix lit test

binarman force-pushed the fma_operand_linearlayout branch from 140727d to 547f7f8 Compare December 20, 2024 14:43

binarman changed the title ~~[WIP] Implement conversion from FMA dot operand to linear layout~~ Implement conversion from FMA dot operand to linear layout Dec 20, 2024

binarman commented Dec 20, 2024

View reviewed changes

unittest/Dialect/TritonGPU/LinearLayoutConversionsTest.cpp Show resolved Hide resolved

remove redundant changes

2157f20

binarman marked this pull request as ready for review December 20, 2024 15:08

binarman requested a review from ptillet as a code owner December 20, 2024 15:08

binarman commented Dec 20, 2024

View reviewed changes

include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td Outdated Show resolved Hide resolved

fix typo

a7c978b

lezcano reviewed Dec 20, 2024

View reviewed changes

Jokeren reviewed Dec 20, 2024

View reviewed changes

binarman added 3 commits December 20, 2024 20:52

generate warp and lane layout in broadcast form

eb33d00

- remove legacy converter from pattern

7f2d2a6

- cleanup hash function in FMA.cpp - add more details in TODO in SharedToDotOperandFMA.cpp - cleanup DotOperandEncodingAttr::toLinearLayout

add dot 3d test

c372e5a

binarman requested review from lezcano and Jokeren December 20, 2024 21:34

Jokeren approved these changes Dec 22, 2024

View reviewed changes

lezcano approved these changes Dec 22, 2024

View reviewed changes

This was referenced Dec 24, 2024

[WIP] [AMD] Emit AMD specific intrinsics for dot #4594

Draft

FMA can't support for dot3d #5494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement conversion from FMA dot operand to linear layout #5469

Implement conversion from FMA dot operand to linear layout #5469

binarman commented Dec 19, 2024 •

edited by lezcano

Loading

lezcano Dec 19, 2024

binarman Dec 20, 2024 •

edited

Loading

lezcano Dec 20, 2024

lezcano Dec 20, 2024

binarman Dec 20, 2024

binarman Dec 20, 2024

lezcano left a comment

lezcano Dec 20, 2024

binarman Dec 20, 2024

lezcano Dec 20, 2024

binarman Dec 20, 2024

lezcano Dec 22, 2024

Jokeren Dec 20, 2024

binarman Dec 20, 2024

Jokeren Dec 20, 2024

binarman Dec 20, 2024 •

edited

Loading

Jokeren Dec 20, 2024

Jokeren Dec 20, 2024

lezcano Dec 20, 2024

binarman Dec 20, 2024 •

edited

Loading

binarman Dec 20, 2024

Jokeren Dec 20, 2024

binarman Dec 20, 2024

binarman Dec 20, 2024 •

edited

Loading

Jokeren Dec 20, 2024 •

edited

Loading

Jokeren Dec 22, 2024

Jokeren Dec 20, 2024

binarman commented Dec 20, 2024 •

edited

Loading

binarman commented Dec 20, 2024

Jokeren Dec 22, 2024

		if (!verifyCTALayout(dLayout.getCTALayout()))
		return Value();

	if (auto mfmaLayout = mlir::dyn_cast<AMDMfmaEncodingAttr>(parent)) {
	else if (auto mfmaLayout = mlir::dyn_cast<AMDMfmaEncodingAttr>(parent)) {

Implement conversion from FMA dot operand to linear layout #5469

Are you sure you want to change the base?

Implement conversion from FMA dot operand to linear layout #5469

Conversation

binarman commented Dec 19, 2024 • edited by lezcano Loading

Choose a reason for hiding this comment

binarman Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Jokeren Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarman commented Dec 20, 2024 • edited Loading

binarman commented Dec 20, 2024

Choose a reason for hiding this comment

binarman commented Dec 19, 2024 •

edited by lezcano

Loading

binarman Dec 20, 2024 •

edited

Loading

binarman Dec 20, 2024 •

edited

Loading

binarman Dec 20, 2024 •

edited

Loading

binarman Dec 20, 2024 •

edited

Loading

Jokeren Dec 20, 2024 •

edited

Loading

binarman commented Dec 20, 2024 •

edited

Loading