Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LHS Registers Part 2 - Pipelining #19

Closed
wants to merge 18 commits into from

Conversation

ggengnv
Copy link

@ggengnv ggengnv commented Sep 23, 2024

(Part 1: #18)
Part 2 of "WGMMA with LHS operand in registers" feature.
Commits from 0f4faac "Initial changes for pipelining" onward are relevant to this part. The previous commits were for part 1.

This PR enables SMEM pipelining for WGMMA operand A when it's in RF. It is necessary to add this change to those in part 1 to actually achieve better performance.

@ggengnv
Copy link
Author

ggengnv commented Sep 23, 2024

Addressed all comments in the original PR that are relevant to part 2 (one comment) in this PR instead.

@Moerafaat Moerafaat force-pushed the llvm-head branch 2 times, most recently from da8895b to c8f89a6 Compare October 30, 2024 09:27
ThomasRaoux pushed a commit to triton-lang/triton that referenced this pull request Nov 15, 2024
…for SMEM-to-MMAv3 DotOp Copy (#5003)

Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS"
(LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton
previously will copy A from global memory (GMEM) into registers (RF),
perform the elementwise ops, and then copy to shared memory (SMEM) to
perform SS WGMMA.

This PR adds an optimization for the case above to use RS GEMM. This
requires the following changes:

- In TritonGPU OptimizeDotOperands pass, add optimizations to change SS
GEMM into RS GEMM.
- Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3
dotOperand layout.

NOTE: This may not see perf gain, and may even see perf loss, for
certain shapes (e.g. small-K), and additional optimizations are in a
separate [PR](openxla#19) (still more
optimizations are WIP). Please advise on the merging strategy.
hmalgewatta pushed a commit to hmalgewatta/triton that referenced this pull request Nov 15, 2024
…for SMEM-to-MMAv3 DotOp Copy (triton-lang#5003)

Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS"
(LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton
previously will copy A from global memory (GMEM) into registers (RF),
perform the elementwise ops, and then copy to shared memory (SMEM) to
perform SS WGMMA.

This PR adds an optimization for the case above to use RS GEMM. This
requires the following changes:

- In TritonGPU OptimizeDotOperands pass, add optimizations to change SS
GEMM into RS GEMM.
- Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3
dotOperand layout.

NOTE: This may not see perf gain, and may even see perf loss, for
certain shapes (e.g. small-K), and additional optimizations are in a
separate [PR](openxla#19) (still more
optimizations are WIP). Please advise on the merging strategy.
@ggengnv
Copy link
Author

ggengnv commented Nov 18, 2024

merged into upstream triton

@ggengnv ggengnv closed this Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants