[AMD][FA] Improve warp distribution for attention second dot #5892

zhanglx13 · 2025-02-12T02:01:22Z

No description provided.

antiagainst · 2025-02-12T04:20:26Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+// Check if the result of current tl.dot is used as the operand(0)
+// of another tl.dot
+bool isChainDotHead(tt::DotOp &dotOp) {
+  auto filter = [&dotOp](Operation *op) {


Typically we want to have self-documenting function/variable name for readability. It's not a big concern here given it's a oneliner, but isInSameRegion would be better than a general filter name here.

antiagainst · 2025-02-12T04:24:18Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+    if (isa<tt::DotOp>(op) && (op != dotOp)) {
+      auto dOp = dyn_cast<tt::DotOp>(op);


op should never be dotOp here given we don't set fwdOpt.inclusive to true? To be certain you can put an assert in the above. Also typically in MLIR we do like

if (auto userDotOp = dyn_cast<tt::DotOpInterface>()) { ... }

Then in the middle you don't need to cast again.

antiagainst · 2025-02-12T04:28:06Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+    // ensure output of the first dot is the operand 0 of the second dot
+    if (isa<tt::DotOp>(op) && (op != dotOp)) {
+      auto dOp = dyn_cast<tt::DotOp>(op);
+      auto op0 = dOp.getOperand(0).getDefiningOp();


Prefer to use friendly accessors like .getA().

antiagainst · 2025-02-12T04:29:44Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+      if (op0 && std::find(fwdSlices.begin(), fwdSlices.end(), op0) !=
+                     fwdSlices.end()) {


You can use fwdSlices.contains here?

antiagainst · 2025-02-12T04:40:54Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

@@ -44,6 +44,50 @@ int getWmmaVersion(StringRef archGen) {
  return 0;
 }

+// Check if the result of current tl.dot is used as the operand(0)
+// of another tl.dot
+bool isChainDotHead(tt::DotOp &dotOp) {


In these two functions, use DotOpInterface instead of hardcoded tt::DotOp so it works for DotScaledOp too. Also typically for friendly named Ops (that is, tt::DotOp, not Operation *), we don't pass as a reference; we directly pass as a value because Ops are just a wrapper of Operation *.

antiagainst · 2025-02-12T04:42:42Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+
+// Check if the operand(0) of current tl.dot is the result of
+// another tl.dot
+bool isChainDotTail(tt::DotOp &dotOp) {


Similarly for this function.

antiagainst · 2025-02-12T04:43:04Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+  // For FA-like pattern, i.e. result of 1st tl.dot is used as the
+  // operand(0) of the 2nd dot.
+  // We use {numWaprs, 1} for both tl.dots


These three lines should be the same sentence.

antiagainst · 2025-02-12T04:48:09Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+  auto ttDotOp = dyn_cast<tt::DotOp>(dotOp);
+  if (isChainDotHead(ttDotOp) || isChainDotTail(ttDotOp)) {
+    if ((shape[0] == shapePerWarp.first) && isChainDotTail(ttDotOp))
+      return {1, (unsigned)numWarps};


We need to add a lit test for this.

antiagainst · 2025-02-12T04:53:39Z

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

+  // {1, numWarps} for the 2nd tl.dot to save registers
+  auto ttDotOp = dyn_cast<tt::DotOp>(dotOp);
+  if (isChainDotHead(ttDotOp) || isChainDotTail(ttDotOp)) {
+    if ((shape[0] == shapePerWarp.first) && isChainDotTail(ttDotOp))


You can use a local variable to save the result for isChainDotTail to avoid compute it again. (The C++ compiler should do it but I'm not 100% sure.) Also this is a specific case. I suspect we want to swap here as long as it's more beneficial to distribute along second dot's N dim? That is, ceildiv(shape[0], shapePerWarp.first) < ceildiv(shape[1], shapePerWarp.second)?

Force the secondDot to have warpsPerCTA={1, numWarps} if BLOCK_M == mDim

18be627

zhanglx13 force-pushed the improve_fa_decode branch from f855bf6 to 18be627 Compare February 12, 2025 02:28

antiagainst requested changes Feb 12, 2025

View reviewed changes

antiagainst changed the title ~~[AMD][FA] Force the 2nd dot to have warpsPerCTA={1, numWarps} if BLOCK_M == mDim~~ [AMD][FA] Improve warp distribution for attention second dot Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD][FA] Improve warp distribution for attention second dot #5892

[AMD][FA] Improve warp distribution for attention second dot #5892

zhanglx13 commented Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

antiagainst Feb 12, 2025

		if (isa<tt::DotOp>(op) && (op != dotOp)) {
		auto dOp = dyn_cast<tt::DotOp>(op);

		if (op0 && std::find(fwdSlices.begin(), fwdSlices.end(), op0) !=
		fwdSlices.end()) {

[AMD][FA] Improve warp distribution for attention second dot #5892

Are you sure you want to change the base?

[AMD][FA] Improve warp distribution for attention second dot #5892

Conversation

zhanglx13 commented Feb 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment