Matmul Optimizations Short-Term Roadmap #128

louisfd · 2024-09-18T20:11:55Z

Mostly notes to myself as this is very precise and short term. Concerns the cmma version of matmul.
Each step should come with a config so it doesn't become mandatory and therefore we can merge faster.
Steps must be done in this order:

Compute loop

Invert k and n for loops: at the moment the outer loop is on n and the inner on k, which forbids the next step:
Keep lhs fragment in register instead of reloading the same data on each iteration

Loading to SMEM

At the moment each warp is filling its own tile, while there is no real reason to split responsibilities this way. We should instead change the algorithm so a warp loads as many as it can in one coalesced write (with vectorization considered), then offsets itself by this value times the number of warps, of course following the respective layouts of the GMEM and SMEM.
Allow for different layouts of the SMEM. As of now tiles from lhs are row major and tiles of rhs are col major, but the contrary is probably more suitable for double buffering.
Instead of assuming which warp loads what based on their cube position, write the loading as a function of an id and a number of warp for better flexibility.

Config

Relax the config constraint b_m = b_n
Fix num_compute_warps = b_m / 16, num_buffers = b_k / 16, num_accumulators = b_n / 16. Have one warp per row to maxime reutilization of lhs fragment.

Double buffering with warp specialization

Specialize some warps into compute warps. The number should equal the number of tensor cores (typically 4 or 8), and should be num_compute_warps. Allow some other warps to serve as loading warps. Adjust sync_units accordingly.
Define specialization strategy, to determine which warps should do what (for instance: 0..7 compute, the rest loads)
Use double buffering by alternating using first and second half of SMEM
It's probably better to have the compute warps load the first tiles to save a sync

louisfd added the enhancement New feature or request label Sep 18, 2024

louisfd self-assigned this Sep 18, 2024

This was referenced Sep 19, 2024

Cmma/invert k n loops #131

Merged

reuse lhs frag strategy #132

Merged

Cmma/continous warp loading #138

Merged

Cmma/relative warp ids #144

Merged

Cmma/Relax b_m = b_n #148

Merged

Cmma: new strategy for num compute planes + many refactors #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul Optimizations Short-Term Roadmap #128

Matmul Optimizations Short-Term Roadmap #128

louisfd commented Sep 18, 2024 •

edited

Loading

Matmul Optimizations Short-Term Roadmap #128

Matmul Optimizations Short-Term Roadmap #128

Comments

louisfd commented Sep 18, 2024 • edited Loading

louisfd commented Sep 18, 2024 •

edited

Loading