[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR does two things
iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false
flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this.[Codegen][llvmgpu] Refactor op cloning in prefetch shared memory pass #19196
[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307
[MLIR] Add allow Insert/extract slice option to pack/unpack op llvm/llvm-project#117340
[Codegen] Allow padding of dynamic allocas #19399
[Tensor] Simplify tenor.pad tiling length calculations. llvm/llvm-project#119039
Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. Here is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed.
The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change.
The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902