Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add PARTED kernels #382

Open
wants to merge 39 commits into
base: develop
Choose a base branch
from
Open

WIP: Add PARTED kernels #382

wants to merge 39 commits into from

Conversation

MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Oct 25, 2023

WIP: Add PARTED kernels

These parted or partitioned kernels do the same computation over the same data as the Stream TRIAD kernel, but instead of a single loop over all of the data they use multiple kernels over parts of the data. So the same work is ultimately done to the same data just broken up into multiple partitions.
The idea is to look at ways of improving performance when running the same kernel over different data like when running over subdomains or tiles in a block structured code or with AMR.

Ideas:

@MrBurmark MrBurmark changed the title Add TRIAD_PARTED kernel Add PARTED kernels Oct 25, 2023
@MrBurmark MrBurmark force-pushed the feature/burmark1/parted branch 2 times, most recently from 1536b7a to fd328c2 Compare October 25, 2023 16:35
@MrBurmark MrBurmark changed the title Add PARTED kernels WIP: Add PARTED kernels Nov 28, 2023
This does the same thing as TRIAD but breaks it into multiple
for loops over the data instead of a single for loop over the data.
Leave in comments of other dispatch options.
This makes each partition a multiple of the size of the
prevoius partition
This tuning provides a best case scenario where the overhead
of capturing the state and synchronizing per rep is removed.
The new gpu tuning is a AOS version using triad_holder.
This is now in addition to the SOA tuning.
This copies the basic mempool from RAJA and adds
a capability to synchronize as necessary to avoid host
device race conditions when memory is needed on the host
and but all the memory has been used on the device.
Default is on so the sizes of partitions are not always
in non-decreasing order.
This uses a scan and binary search to schedule work to blocks
instead of a 2d grid. Thus it avoids blocks with no work.
This is faster for cuda but slower for hip.
with triad parted fused
This has a large effect and makes a block size of 256
as good or better than 1024
always use binary search code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants