[optimize-dot-operands]: Fuse load and trans operations - part 3 #4537

etiotto · 2025-06-18T23:16:58Z

Enhance the transformation to allow laod+transpose fusion in separate for loops when the def-use chains corresponding to the 2 load+transpose instances originate at the make_tensor_ptr operation.

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.1

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.2

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

Signed-off-by: Tiotto, Ettore <[email protected]>

etiotto · 2025-06-18T23:18:08Z

Depends on: #4468

Copilot

Pull Request Overview

This PR enhances dot operands optimization by fusing load and transpose operations in separate loops when the def‑use chains originate from a make_tensor_ptr, and by refactoring cleanup routines.

Added a new optimization pass (optimize_dot_operands) in multiple backend components.
Introduced a new eraseOperations utility and refactored fusion logic in OptimizeDotOperands.cpp.
Updated test cases to validate proper fusion and non‐fused behavior.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
third_party/intel/triton_xpu.cc	Added optimize_dot_operands pass registration.
third_party/intel/lib/Utils/Utility.cpp	Added a new eraseOperations function for cleanup operations.
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp	Refactored fusion logic and propagation routines to support optimized chaining.
third_party/intel/lib/Dialect/Triton/Transforms/TensorDescToBlockPointer.cpp	Removed redundant finalize() in favor of using eraseOperations.
third_party/intel/include/Utils/Utility.h	Declared the new eraseOperations function.
third_party/intel/backend/compiler.py	Registered the new optimize_dot_operands pass in the compiler backend.
test/TritonIntelGPU/dot-operands.mlir	Updated test cases to reflect changes in fusion behavior and new pass functionality.

Comments suppressed due to low confidence (2)

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp:161

[nitpick] The singleUsersInChain function is quite complex; consider refactoring the logic or adding more inline comments to improve readability and maintainability.

  // Determine whether all operations in the def-use chain from \p start to

third_party/intel/lib/TritonIntelGPUTransforms/OptimizeDotOperands.cpp:112

[nitpick] Consider renaming the lambda 'usedByDotOp' to a more descriptive name such as 'isChainedToDotOp' to clarify its purpose.

    auto usedByDotOp = [](tt::TransOp transOp) {

…tt.dot Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.3

Signed-off-by: Tiotto, Ettore <[email protected]>

…th_trans.3

etiotto · 2025-06-23T13:05:17Z

python/tutorials/06-fused-attention.py

@@ -68,7 +71,10 @@ def _attn_fwd_inner(acc, l_i, m_i, q,  #
    for start_n in tl.range(lo, hi, BLOCK_N, warp_specialize=warp_specialize):
        start_n = tl.multiple_of(start_n, BLOCK_N)
        # -- compute qk ----
-        k = desc_k.load([0, offsetk_y])
+        if dtype == tl.float8e5:


For fp16 we undo the source code changes we made and the code is now back to the original. For FP8 we keep the source code changes until we can issue DPAS instructions for them (after making 2 fp8 elems into a fp16).

etiotto · 2025-06-23T13:05:47Z

test/TritonIntelGPU/dot-operands.mlir

@@ -80,15 +79,6 @@ module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32} {
    %c1_i64 = arith.constant 1 : i64
    %c1024_i64 = arith.constant 1024 : i64
    %cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #mma>
-    %0 = tt.get_program_id x : i32


Just making the test simpler here

etiotto · 2025-06-23T17:26:52Z

ping @whitneywhtsang, @chengjunlu, @LiyangLingIntel any comments ?

test/TritonIntelGPU/dot-operands.mlir

whitneywhtsang · 2025-06-23T23:38:04Z