[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

manupak · 2024-12-10T15:57:51Z

Currently, in the layout conversions with shared_memory_conversion we directly read in the destination layout. However, if the maximum vector length is not present in the element tile, it would generate sub-optimal shared memory reads.

This commit introduces a new layout for read such that shared memory reads are better vectorized.
[Maybe I can break this into two PRs] it also adds the ability to distribute thread tile when it differs.

After this change, I dont see a perf difference between transpose V attention vs normal attention.
Moreover, it improves performance for both cases.
Obviously, the improvement is much larger for the latter.

conversions Currently, in the layout conversions with shared_memory_conversion we directly read in the destination layout. However, if the maximum vector length is not present in the element tile, it would generate sub-optimal shared memory reads. This commit introduces a new layout for read such that shared memory reads are better vectorized.

manupak · 2024-12-10T15:59:41Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUVectorAlloc.cpp

+        int64_t &batchTileLen = batchTile.back();
+        int64_t &outerTileLen = outerTile.back();
+        // TODO: maybe we should obtain this from somewhere ?
+        constexpr int64_t maxVecLenBits = 128;


Is there a way to obtain arch specific information in a pass ?

manupak · 2024-12-10T16:04:51Z

@qedawkins @Groverkss @raikonenfnu I found a simpler solution to bad lds read problem.
Now I dont see them anymore with these change and performance is much better. I ll post those results here.

I actually did not need to transpose because reduce bank conflict pass in conjuction with the technique here was able to get vectorized non bank conflicting shared memory loads

There are two main changes here, which I can potentially break up into two PRs if necessary.

Groverkss · 2024-12-10T16:06:46Z

Very cool! I will have a look later today.

manupak · 2024-12-10T16:12:27Z

I need to add tests; Holding that off until I get thumbs up on the approach here.

manupak · 2024-12-10T16:33:55Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUNestedLayoutDistributionPatterns.cpp

@@ -906,17 +903,19 @@ struct DistributeBatchOuterToLayoutConversions final
    SmallVector<int64_t> shapeB = layoutB.getDistributedShape();
    int64_t rank = layoutA.getRank();

-    // Interleave batch and outer dims by transposing.
+    // Interleave in-thread elements by transposing.


ok this is generally not right.. thanks @Groverkss point it out.

This works currently because the producer of this is a transfer_read

manupak · 2024-12-10T16:34:50Z

Im doing v2 of this in the lines of making distribution of transfer_read more smarter.

manupak requested review from antiagainst and qedawkins as code owners December 10, 2024 15:57

manupak marked this pull request as draft December 10, 2024 15:57

manupak commented Dec 10, 2024

View reviewed changes

manupak requested review from Groverkss and raikonenfnu December 10, 2024 16:00

Groverkss requested a review from MaheshRavishankar December 10, 2024 16:12

manupak commented Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

manupak commented Dec 10, 2024

manupak Dec 10, 2024

manupak commented Dec 10, 2024

Groverkss commented Dec 10, 2024

manupak commented Dec 10, 2024

manupak Dec 10, 2024

manupak Dec 10, 2024

manupak commented Dec 10, 2024

[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

Are you sure you want to change the base?

[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

Conversation

manupak commented Dec 10, 2024

manupak Dec 10, 2024

Choose a reason for hiding this comment

manupak commented Dec 10, 2024

Groverkss commented Dec 10, 2024

manupak commented Dec 10, 2024

manupak Dec 10, 2024

Choose a reason for hiding this comment

manupak Dec 10, 2024

Choose a reason for hiding this comment

manupak commented Dec 10, 2024