-
Notifications
You must be signed in to change notification settings - Fork 288
[CUB] Use BlockLoadToShared in DeviceMerge
#6077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
e05ad2b to
582da7c
Compare
This comment was marked as resolved.
This comment was marked as resolved.
|
B200 (UBLKCPY)['/home/pgrossebley/SM_100_merge_keys_final_old.json', '/home/pgrossebley/SM_100_merge_keys_final_newest.json'] base[0] NVIDIA B200
Summary
H200 (UBLKCPY)['/home/pgrossebley/SM_90_merge_keys_final_old.json', '/home/pgrossebley/SM_90_merge_keys_final_newest.json'] base[0] NVIDIA H200
Summary
A100 (LDGSTS)['/home/pgrossebley/SM_80_merge_keys_final_old.json', '/home/pgrossebley/SM_80_merge_keys_final_newest.json'] base[0] NVIDIA A100 80GB PCIe
Summary
RTX 5090 (UBLKCPY)['/home/pgrossebley/SM_120_merge_keys_old_final_CTK13.json', '/home/pgrossebley/SM_120_merge_keys_newest_final_CTK13.json'] base[0] NVIDIA GeForce RTX 5090
Summary
|
|
B200 (UBLKCPY)['/home/pgrossebley/SM_100_merge_pairs_final_old.json', '/home/pgrossebley/SM_100_merge_pairs_final_newest.json'] base[0] NVIDIA B200
Summary
H200 (UBLKCPY)['/home/pgrossebley/SM_90_merge_pairs_final_old.json', '/home/pgrossebley/SM_90_merge_pairs_final_newest.json'] base[0] NVIDIA H200
Summary
A100 (LDGSTS)Did not fit due to GH comment character limit, see newer comment below. RTX 5090 (UBLKCPY)['/home/pgrossebley/SM_120_merge_pairs_old_final_CTK13.json', '/home/pgrossebley/SM_120_merge_pairs_newest_final_CTK13.json'] base[0] NVIDIA GeForce RTX 5090
Summary
|
This comment was marked as outdated.
This comment was marked as outdated.
|
The benchmark looks very promising! There are few runs though that regressed a lot, like some 2^16 workloads with more than 20% slowdown. I think it would be worthwhile to understand what's going on there. |
582da7c to
14a356a
Compare
This comment was marked as resolved.
This comment was marked as resolved.
This comment has been minimized.
This comment has been minimized.
|
A100 (LDGSTS)['/home/pgrossebley/SM_80_merge_pairs_final_old.json', '/home/pgrossebley/SM_80_merge_pairs_final_newest.json'] base[0] NVIDIA A100 80GB PCIe
Summary
|
|
@bernhardmgruber It seems that benchmarking small problem sizes (2^16) is not super reproducible on the RTX 5090. I replaced the old results with a newer run which has far less and smaller slowdowns. The few bigger slowdowns at 2^16 elements appear for different benchmarks than the slowdowns in the previous results. I can also see differences in the 2^16 results between running just those benchmarks or the full default set including bigger sizes. The new run also has a slightly different environment hopefully nearer to the cluster (CTK 12.9->13.0, graphical->multiuser and some other probably less important things like kernel updates, turned on persistence mode) which might have somewhat increased reproducibility (runtimes for the old implementation have profited more than runtimes for the new implementation). |
This avoids needing to specify the cache modifier in two places and eases integration of NVIDIA#6077
This avoids needing to specify the cache modifier in two places and eases integration of #6077
14a356a to
cf6dfa4
Compare
|
/ok to test cf6dfa4 |
|
I rebased the branch on main. |
cub/cub/agent/agent_merge.cuh
Outdated
| (void) translate_indices; | ||
|
|
||
| item_type* items1_shared; | ||
| if constexpr (keys_use_block_load_to_shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: I think this should be items_use_block_load_to_shared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exposed by new tests: #6455. But only after some refactoring that I have not yet pushed to this branch.
cf6dfa4 to
0d89c9f
Compare
fe08eaf to
c2c52f0
Compare
|
/ok to test c2c52f0 |
|
@pauleonix and @elstehle #6460 offers a refactoring and additional tuning policies on top of this PR. We should merge it into this PR and then merge this PR. |
This reverts commit a9fd762.
🥳 CI Workflow Results🟩 Finished in 5h 21m: Pass: 100%/81 | Total: 4d 21h | Max: 5h 20m | Hits: 50%/72769See results here. |
|
Great work @pauleonix! The first CUB algo is now using |
|
That is awesome @pauleonix , great job 🎉 |
Description
closes #6005 (previous PR #5926 was wrongly closed, go there for slightly outdated performance results until I generate new ones)
Since
DeviceMergeneeds its data in shared memory either way, it is a good candidate to demonstrate the newBlockLoadToSharedinterface in the real world. It also comes with some complications since we know the size of the output tile at compile-time, but how many elements to load from each of the ranges to be merged (calculated at runtime using merge-path). This can be solved by introducing additional padding (size known at compile-time) for in between the two dynamically sized input buffers.Checklist