LHS Registers Part 2 - Pipelining #19

ggengnv · 2024-09-23T23:19:36Z

(Part 1: #18)
Part 2 of "WGMMA with LHS operand in registers" feature.
Commits from 0f4faac "Initial changes for pipelining" onward are relevant to this part. The previous commits were for part 1.

This PR enables SMEM pipelining for WGMMA operand A when it's in RF. It is necessary to add this change to those in part 1 to actually achieve better performance.

ggengnv · 2024-09-23T23:37:39Z

Addressed all comments in the original PR that are relevant to part 2 (one comment) in this PR instead.

…ng#4828)

…for SMEM-to-MMAv3 DotOp Copy (#5003) Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers). In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA. This PR adds an optimization for the case above to use RS GEMM. This requires the following changes: - In TritonGPU OptimizeDotOperands pass, add optimizations to change SS GEMM into RS GEMM. - Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3 dotOperand layout. NOTE: This may not see perf gain, and may even see perf loss, for certain shapes (e.g. small-K), and additional optimizations are in a separate [PR](openxla#19) (still more optimizations are WIP). Please advise on the merging strategy.

…for SMEM-to-MMAv3 DotOp Copy (triton-lang#5003) Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers). In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA. This PR adds an optimization for the case above to use RS GEMM. This requires the following changes: - In TritonGPU OptimizeDotOperands pass, add optimizations to change SS GEMM into RS GEMM. - Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3 dotOperand layout. NOTE: This may not see perf gain, and may even see perf loss, for certain shapes (e.g. small-K), and additional optimizations are in a separate [PR](openxla#19) (still more optimizations are WIP). Please advise on the merging strategy.

ggengnv · 2024-11-18T23:41:39Z

merged into upstream triton

This was referenced Sep 23, 2024

Optimization to put LHS operand in registers for WGMMA before elementwise ops #17

Closed

LHS Registers Part 1 - DotOp Hoisting and SMEM-RF Copy Lowering #18

Closed

ggengnv force-pushed the lhs-reg-pipeline branch from 3adec01 to 995c0b8 Compare September 23, 2024 23:36

ggengnv force-pushed the lhs-reg-pipeline branch 2 times, most recently from b4d61b8 to deefac7 Compare September 24, 2024 22:52

Moerafaat force-pushed the llvm-head branch from 3596dc5 to 10d3305 Compare September 25, 2024 14:11

ggengnv force-pushed the lhs-reg-pipeline branch from deefac7 to 1cb92c3 Compare September 25, 2024 22:13

vwbaker added 2 commits September 30, 2024 09:47

[BACKEND] Update LLVM version to llvm/llvm-project@29b92d0 (triton-la…

6fa4f50

…ng#4828)

OpenXLA-specific changes

1ef30e9

vwbaker force-pushed the llvm-head branch from 10d3305 to 1ef30e9 Compare September 30, 2024 12:15

ggengnv added 16 commits October 9, 2024 17:51

Add preliminary logic to hoist elt-wise ops for MMAv3

83cd631

Lower shared > v3 dotOp & improve hoisting logic

30a670c

Fix test regressions

f4fe44b

Rewrite OptimizeDotOperands logic and add tests

da07d16

Improve comments

ab36a0f

Improve documentation and refactor

3b4ffc2

Rename SharedToDotOperandMMAv2 -> ...v2OrV3

cdf2ae0

Remove debug flags in test_core.py

7a8ac2e

Fix bad rename

d898568

Initial changes for pipelining

3bf5ddc

Add pipeline test

be4e1e3

Refactor MatmulLoopPipeline

34b46c6

Improve coalescing for global to local copy

579f8f2

fix typo

ba48bf1

Remove old comment

693a719

Fix check in getMMALoadType

56eefde

ggengnv force-pushed the lhs-reg-pipeline branch from 1cb92c3 to 56eefde Compare October 9, 2024 22:19

Moerafaat mentioned this pull request Oct 14, 2024

Requirements to pass WGMMA LHS operand in registers triton-lang/triton#4785

Open

ggengnv mentioned this pull request Oct 28, 2024

[BACKEND][NVIDIA] Add DotOp Hoisting Pass for WGMMA and Add Lowering for SMEM-to-MMAv3 DotOp Copy triton-lang/triton#5003

Merged

Moerafaat force-pushed the llvm-head branch 2 times, most recently from da8895b to c8f89a6 Compare October 30, 2024 09:27

chsigg force-pushed the llvm-head branch from c8f89a6 to 7c407a3 Compare November 6, 2024 16:33

ggengnv closed this Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LHS Registers Part 2 - Pipelining #19

LHS Registers Part 2 - Pipelining #19

ggengnv commented Sep 23, 2024 •

edited

Loading

ggengnv commented Sep 23, 2024 •

edited

Loading

ggengnv commented Nov 18, 2024

LHS Registers Part 2 - Pipelining #19

LHS Registers Part 2 - Pipelining #19

Conversation

ggengnv commented Sep 23, 2024 • edited Loading

ggengnv commented Sep 23, 2024 • edited Loading

ggengnv commented Nov 18, 2024

ggengnv commented Sep 23, 2024 •

edited

Loading

ggengnv commented Sep 23, 2024 •

edited

Loading