Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPUVectorDistribute] Re-arrange nested layouts for better conversions #19437

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Dec 10, 2024

Currently, in the layout conversions with shared_memory_conversion we directly read in the destination layout. However, if the maximum vector length is not present in the element tile, it would generate sub-optimal shared memory reads.

This commit introduces a new layout for read such that shared memory reads are better vectorized.
[Maybe I can break this into two PRs] it also adds the ability to distribute thread tile when it differs.

After this change, I dont see a perf difference between transpose V attention vs normal attention.
Moreover, it improves performance for both cases.
Obviously, the improvement is much larger for the latter.

conversions

Currently, in the layout conversions with shared_memory_conversion
we directly read in the destination layout. However, if the maximum
vector length is not present in the element tile, it would generate
sub-optimal shared memory reads.

This commit introduces a new layout for read such that shared memory
reads are better vectorized.
@manupak manupak marked this pull request as draft December 10, 2024 15:57
int64_t &batchTileLen = batchTile.back();
int64_t &outerTileLen = outerTile.back();
// TODO: maybe we should obtain this from somewhere ?
constexpr int64_t maxVecLenBits = 128;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to obtain arch specific information in a pass ?

@manupak
Copy link
Contributor Author

manupak commented Dec 10, 2024

@qedawkins @Groverkss @raikonenfnu I found a simpler solution to bad lds read problem.
Now I dont see them anymore with these change and performance is much better. I ll post those results here.

I actually did not need to transpose because reduce bank conflict pass in conjuction with the technique here was able to get vectorized non bank conflicting shared memory loads

There are two main changes here, which I can potentially break up into two PRs if necessary.

@Groverkss
Copy link
Contributor

Very cool! I will have a look later today.

@manupak
Copy link
Contributor Author

manupak commented Dec 10, 2024

I need to add tests; Holding that off until I get thumbs up on the approach here.

@@ -906,17 +903,19 @@ struct DistributeBatchOuterToLayoutConversions final
SmallVector<int64_t> shapeB = layoutB.getDistributedShape();
int64_t rank = layoutA.getRank();

// Interleave batch and outer dims by transposing.
// Interleave in-thread elements by transposing.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is generally not right.. thanks @Groverkss point it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works currently because the producer of this is a transfer_read

@manupak
Copy link
Contributor Author

manupak commented Dec 10, 2024

Im doing v2 of this in the lines of making distribution of transfer_read more smarter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants