Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable TileAndFuse pipeline for non-intrinsic sized GEMM shapes #18858

Open
4 of 6 tasks
nirvedhmeshram opened this issue Oct 21, 2024 · 0 comments
Open
4 of 6 tasks

Enable TileAndFuse pipeline for non-intrinsic sized GEMM shapes #18858

nirvedhmeshram opened this issue Oct 21, 2024 · 0 comments
Assignees

Comments

@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Oct 21, 2024

This is a tracking issue for all the pieces needed for switching to using TileAndFuse for non-instrinsic sized GEMM shapes. Prototype branch is provided here https://github.com/nirvedhmeshram/iree/tree/bmm_tileandfuse_2

  • bailout logic in GPUReduceBankConflict pass for collapse shape users:
    Currently we can crash in this pass because of this upstream issue. [mlir][memref] Collapse on strided memref is conservative llvm/llvm-project#112994. We will bailout for collapse users to avoid this.

  • Land Pad support when promoting
    We have a prototype commit for padding here but it needs to be refactored/improved in the following ways
    nirvedhmeshram@3fc1628
    Make the padding part of tileandfuse config rather then generating it on the fly. Also need to handle support of acc type matmuls. as currently the padding generated for it is not getting tiled see dump here
    Simplify the promotion logic for C matrix and support both accumulate and non-accumlate type Gemms.

  • Go back to previous pass order before [LLVMGPU] Use forall workgroup distribution in TileAndFuse pipeline #18565 where we would distribute to workgroups and then do promotion. Currently we end up with copies that dont get distributed with this change. This was done as a workaround for another bug related to dps conversion. This will require additional logic in convert to dps pass which should solve both issues. cc @Max191

  • Fix barrier placement when there is a result writeback with a different thread distribution.
    Currently we generate IR like this for cases with padding after GPUGreedilyDistributeToThreadsPass
    https://gist.github.com/nirvedhmeshram/e3b8260fe3d81e2ae6fd928fd4297b28
    The problem is that iree_gpu.multi_mma is not in a barrier region and there is a thread distributed write back loop following it that needs to happen after all mma ops in the workgroup are finished. Current thought is that we can insert barrier here
    Edit : We realized that barrier insertion after mfma was not necessary and the race we were seeing was most likely due to backend compiler not satisfying the latency constraints of mfma. This issue is fixed by backend and we did not need to write any new logic for this.

  • Make sure we have functionality parity with SIMT pipeline and performance parity/improvements with the VectorDistribute/PadVectorDistribute pipeline.
    We have some nice to have feature requests for easily testing such changes.
    Add a batch matmul suite nod-ai/iree-kernel-benchmark#25
    Adapt scripts to save lowering configs as part of the result csv nod-ai/iree-kernel-benchmark#26

  • Turn on the pipeline by default in IREE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant