You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Land Pad support when promoting
We have a prototype commit for padding here but it needs to be refactored/improved in the following ways nirvedhmeshram@3fc1628
Make the padding part of tileandfuse config rather then generating it on the fly. Also need to handle support of acc type matmuls. as currently the padding generated for it is not getting tiled see dump here
Simplify the promotion logic for C matrix and support both accumulate and non-accumlate type Gemms.
Go back to previous pass order before [LLVMGPU] Use forall workgroup distribution in TileAndFuse pipeline #18565 where we would distribute to workgroups and then do promotion. Currently we end up with copies that dont get distributed with this change. This was done as a workaround for another bug related to dps conversion. This will require additional logic in convert to dps pass which should solve both issues. cc @Max191
Fix barrier placement when there is a result writeback with a different thread distribution.
Currently we generate IR like this for cases with padding after GPUGreedilyDistributeToThreadsPass https://gist.github.com/nirvedhmeshram/e3b8260fe3d81e2ae6fd928fd4297b28
The problem is that iree_gpu.multi_mma is not in a barrier region and there is a thread distributed write back loop following it that needs to happen after all mma ops in the workgroup are finished. Current thought is that we can insert barrier here
Edit : We realized that barrier insertion after mfma was not necessary and the race we were seeing was most likely due to backend compiler not satisfying the latency constraints of mfma. This issue is fixed by backend and we did not need to write any new logic for this.
This is a tracking issue for all the pieces needed for switching to using TileAndFuse for non-instrinsic sized GEMM shapes. Prototype branch is provided here https://github.com/nirvedhmeshram/iree/tree/bmm_tileandfuse_2
bailout logic in GPUReduceBankConflict pass for collapse shape users:
Currently we can crash in this pass because of this upstream issue. [mlir][memref] Collapse on strided memref is conservative llvm/llvm-project#112994. We will bailout for collapse users to avoid this.
Land Pad support when promoting
We have a prototype commit for padding here but it needs to be refactored/improved in the following ways
nirvedhmeshram@3fc1628
Make the padding part of tileandfuse config rather then generating it on the fly. Also need to handle support of acc type matmuls. as currently the padding generated for it is not getting tiled see dump here
Simplify the promotion logic for C matrix and support both accumulate and non-accumlate type Gemms.
Go back to previous pass order before [LLVMGPU] Use forall workgroup distribution in TileAndFuse pipeline #18565 where we would distribute to workgroups and then do promotion. Currently we end up with copies that dont get distributed with this change. This was done as a workaround for another bug related to dps conversion. This will require additional logic in convert to dps pass which should solve both issues. cc @Max191
Fix barrier placement when there is a result writeback with a different thread distribution.
Currently we generate IR like this for cases with padding after
GPUGreedilyDistributeToThreadsPass
https://gist.github.com/nirvedhmeshram/e3b8260fe3d81e2ae6fd928fd4297b28
The problem is that
iree_gpu.multi_mma
is not in a barrier region and there is a thread distributed write back loop following it that needs to happen after all mma ops in the workgroup are finished. Current thought is that we can insert barrier hereEdit : We realized that barrier insertion after mfma was not necessary and the race we were seeing was most likely due to backend compiler not satisfying the latency constraints of mfma. This issue is fixed by backend and we did not need to write any new logic for this.
Make sure we have functionality parity with SIMT pipeline and performance parity/improvements with the VectorDistribute/PadVectorDistribute pipeline.
We have some nice to have feature requests for easily testing such changes.
Add a batch matmul suite nod-ai/iree-kernel-benchmark#25
Adapt scripts to save lowering configs as part of the result csv nod-ai/iree-kernel-benchmark#26
Turn on the pipeline by default in IREE.
The text was updated successfully, but these errors were encountered: