[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

nirvedhmeshram · 2024-12-13T21:33:56Z

This PR does two things

Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this.
[Codegen][llvmgpu] Refactor op cloning in prefetch shared memory pass #19196
[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307
[MLIR] Add allow Insert/extract slice option to pack/unpack op llvm/llvm-project#117340
Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now
[Codegen] Allow padding of dynamic allocas #19399
[Tensor] Simplify tenor.pad tiling length calculations. llvm/llvm-project#119039

Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. Here is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed.

The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change.

The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902

… shapes Signed-off-by: Nirvedh <[email protected]>

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

Signed-off-by: Nirvedh <[email protected]>

@nirvedhmeshram

The motivation of this PR is convolution performance for resnet50 configs. With this PR (and a few pending ones), conv performance with igemm pipeline get decent speedup in situation where a standalone dimension size is smaller than intrinsic size. (Take dispatch 69 as example, the select tile m:7, n:512, k:4608 will be rejected from mfma because m tile is smaller than intrinsic size of 16). This happens because previously we are too defensive about when to use intrinsic: in situation when alignment is not required, we still enforce mfma to be picked up only when m/n/k tiles are all larger than intrinsic size. With @nirvedhmeshram's #19271 and #19484, padding is allowed in tile and fuse matmul and igemm tile and fuse pipelines, it is no longer necessary to be as conservative as before. I am therefore getting rid of the conditional check that blocks mfma from being picked up. This will impact a few pipelines that use `canTargetIntrinsic()`: - `LLVMGPUPadAndVectorDistribute` will allow narrow m/n/k dimension sizes for batch matmul - In `iree-codegen-rocdl-configuration-pipeline`, will allow narrow m/n/k dimension sizes for matmul (instead of warp reduction) --------- Signed-off-by: jerryyin <[email protected]>

nirvedhmeshram requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and antiagainst as code owners December 13, 2024 21:33

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic…

8dc6169

… shapes Signed-off-by: Nirvedh <[email protected]>

nirvedhmeshram force-pushed the padded_igemm branch from aab0abe to 8dc6169 Compare December 13, 2024 21:36

nirvedhmeshram requested a review from Max191 December 13, 2024 21:37

MaheshRavishankar reviewed Dec 17, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp Outdated Show resolved Hide resolved

Address reviwer comments

99babff

Signed-off-by: Nirvedh <[email protected]>

nirvedhmeshram requested a review from MaheshRavishankar December 17, 2024 20:54

MaheshRavishankar approved these changes Dec 18, 2024

View reviewed changes

nirvedhmeshram merged commit 8ae1b54 into iree-org:main Dec 18, 2024
40 checks passed

nirvedhmeshram mentioned this pull request Dec 18, 2024

[GPU] Enable GEMMs to first attempt LLVMGPUTileAndFuse with intrinsic by default #19520

Draft

jerryyin mentioned this pull request Jan 6, 2025

[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

nirvedhmeshram commented Dec 13, 2024 •

edited

Loading

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

Conversation

nirvedhmeshram commented Dec 13, 2024 • edited Loading

nirvedhmeshram commented Dec 13, 2024 •

edited

Loading