[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

jerryyin · 2025-01-06T21:57:35Z

The motivation of this PR is convolution performance for resnet50 configs. With this PR (and a few pending ones), conv performance with igemm pipeline get decent speedup in situation where a standalone dimension size is smaller than intrinsic size. (Take dispatch 69 as example, the select tile m:7, n:512, k:4608 will be rejected from mfma because m tile is smaller than intrinsic size of 16). This happens because previously we are too defensive about when to use intrinsic: in situation when alignment is not required, we still enforce mfma to be picked up only when m/n/k tiles are all larger than intrinsic size.

With @nirvedhmeshram's #19271 and #19484, padding is allowed in tile and fuse matmul and igemm tile and fuse pipelines, it is no longer necessary to be as conservative as before. I am therefore getting rid of the conditional check that blocks mfma from being picked up.

This will impact a few pipelines that use canTargetIntrinsic():

LLVMGPUPadAndVectorDistribute will allow narrow m/n/k dimension sizes for batch matmul
In iree-codegen-rocdl-configuration-pipeline, will allow narrow m/n/k dimension sizes for matmul (instead of warp reduction)

Signed-off-by: jerryyin <[email protected]>

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_warp_reduction.mlir

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_vector_distribute_gfx942.mlir

jerryyin · 2025-01-07T21:35:06Z

I experimented with resnet50 again and can confirm between:

completely dropping the size check
size check of 4

makes no difference in full model performance.

This make sure tiny gemm can still be lowered through warp reduction pipeline. Signed-off-by: jerryyin <[email protected]>

qedawkins

Cool LGTM! Also I would pay close attention to the regression tests for SDXL and llama on the CI. There is some wiggle room on those tests so it might pass small regressions.

Allowing mfma for narrow problem config sizes

92987c3

Signed-off-by: jerryyin <[email protected]>

jerryyin requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and antiagainst as code owners January 6, 2025 21:57

jerryyin requested a review from nirvedhmeshram January 6, 2025 21:57

qedawkins requested changes Jan 7, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_warp_reduction.mlir Outdated Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/config_vector_distribute_gfx942.mlir Show resolved Hide resolved

jerryyin force-pushed the users/zyin/remove-intrinsic-size-limitation branch from 22102d7 to f2d5d51 Compare January 7, 2025 21:37

Adding heuristic size check

f2d5d51

This make sure tiny gemm can still be lowered through warp reduction pipeline. Signed-off-by: jerryyin <[email protected]>

qedawkins approved these changes Jan 7, 2025

View reviewed changes

jerryyin merged commit c75b686 into main Jan 8, 2025
36 checks passed

jerryyin deleted the users/zyin/remove-intrinsic-size-limitation branch January 8, 2025 18:37

nirvedhmeshram mentioned this pull request Jan 30, 2025

[GPU] Only dont do padding for pure matvecs #19857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

jerryyin commented Jan 6, 2025

jerryyin commented Jan 7, 2025 •

edited

Loading

qedawkins left a comment

[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

[GPU][Codegen] Allowing mfma for narrow problem config sizes #19615

Conversation

jerryyin commented Jan 6, 2025

jerryyin commented Jan 7, 2025 • edited Loading

qedawkins left a comment

Choose a reason for hiding this comment

jerryyin commented Jan 7, 2025 •

edited

Loading