Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

Merged
merged 2 commits into from
Dec 18, 2024

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Dec 13, 2024

This PR does two things

  1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this.
    [Codegen][llvmgpu] Refactor op cloning in prefetch shared memory pass #19196
    [Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307
    [MLIR] Add allow Insert/extract slice option to pack/unpack op llvm/llvm-project#117340
  2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now
    [Codegen] Allow padding of dynamic allocas #19399
    [Tensor] Simplify tenor.pad tiling length calculations. llvm/llvm-project#119039

Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. Here is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed.

The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change.

The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902

@nirvedhmeshram nirvedhmeshram merged commit 8ae1b54 into iree-org:main Dec 18, 2024
40 checks passed
jerryyin added a commit that referenced this pull request Jan 8, 2025
The motivation of this PR is convolution performance for resnet50
configs. With this PR (and a few pending ones), conv performance with
igemm pipeline get decent speedup in situation where a standalone
dimension size is smaller than intrinsic size. (Take dispatch 69 as
example, the select tile m:7, n:512, k:4608 will be rejected from mfma
because m tile is smaller than intrinsic size of 16). This happens
because previously we are too defensive about when to use intrinsic: in
situation when alignment is not required, we still enforce mfma to be
picked up only when m/n/k tiles are all larger than intrinsic size.

With @nirvedhmeshram's #19271 and
#19484, padding is allowed in tile
and fuse matmul and igemm tile and fuse pipelines, it is no longer
necessary to be as conservative as before. I am therefore getting rid of
the conditional check that blocks mfma from being picked up.

This will impact a few pipelines that use `canTargetIntrinsic()`:
- `LLVMGPUPadAndVectorDistribute` will allow narrow m/n/k dimension
sizes for batch matmul
- In `iree-codegen-rocdl-configuration-pipeline`, will allow narrow
m/n/k dimension sizes for matmul (instead of warp reduction)

---------

Signed-off-by: jerryyin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants