-
Notifications
You must be signed in to change notification settings - Fork 288
Improvements to: Use BlockLoadToShared in DeviceMerge #6077 #6460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements to: Use BlockLoadToShared in DeviceMerge #6077 #6460
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cf6dfa4 to
25d09e9
Compare
f88e45e to
a9fd762
Compare
|
Edit: I later found a regression in #6077 that caused a higher register consumption and so occupancy dropped. We would have to reevaluate the content below (not important for now though): I had a commit that inlined the pairs on H200, translating indices while gathering and vs. separate translation during bulk copy, 2% threshold: So it seems for |
91f84ee to
500a706
Compare
|
pairs on H200, #6077 vs this PR, 2% threshold (NOT the final perf after merging to pairs on H200, |
500a706 to
07c70de
Compare
|
Keys WITHOUT tuning (this is NOT proposed by this PR) keys on H200, #6077 vs this PR, 2% threshold: (same perf) keys on H200, |
|
Keys WITH tuning: keys on H200, #6077 vs this PR, 2% threshold (NOT the final perf after merging to keys on H200, We can see that disabling BlockLoadToShared for |
|
On B200, On B200, I think we can unconditionally enable BlockLoadToShared for Blackwell. Great! |
|
Added some manual tunings for Ampere.
Pure improvement.
Good enough |
25d09e9 to
c2c52f0
Compare
This reverts commit a9fd762.
4025063 to
23ebab4
Compare
|
/ok to test 23ebab4 |
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 16h 15m: Pass: 100%/81 | Total: 4d 12h | Max: 4h 25m | Hits: 67%/72769See results here. |
If BlockLoadToShared is disabled, this PR has no SASS differences to #6077 on sm90.
Perf on Blackwell is amazing!
Perf and tunings on Ampere look ok. I am willing to take the few regressions for improvements elsewhere.
Perf and tunings on Hopper look okish. There are some regressions mixed with improvements where I am not entirely convinced we should take them. Waiting for reviewer feedback.
Summary of all regressions: