WIP: Add PARTED kernels #382

MrBurmark · 2023-10-25T00:07:19Z

WIP: Add PARTED kernels

These parted or partitioned kernels do the same computation over the same data as the Stream TRIAD kernel, but instead of a single loop over all of the data they use multiple kernels over parts of the data. So the same work is ultimately done to the same data just broken up into multiple partitions.
The idea is to look at ways of improving performance when running the same kernel over different data like when running over subdomains or tiles in a block structured code or with AMR.

Ideas:

add kernels listed in Add kernels broken up over multiple loops #381
add a folder to keep this performance study in a single place
split up data unevenly (currently each part gets size/num_parts data)
This PR is a feature
It does the following:
- Adds parted kernel at the request of me in Add kernels broken up over multiple loops #381

This does the same thing as TRIAD but breaks it into multiple for loops over the data instead of a single for loop over the data.

Leave in comments of other dispatch options.

This makes each partition a multiple of the size of the prevoius partition

This tuning provides a best case scenario where the overhead of capturing the state and synchronizing per rep is removed.

The new gpu tuning is a AOS version using triad_holder. This is now in addition to the SOA tuning.

This copies the basic mempool from RAJA and adds a capability to synchronize as necessary to avoid host device race conditions when memory is needed on the host and but all the memory has been used on the device.

Default is on so the sizes of partitions are not always in non-decreasing order.

This uses a scan and binary search to schedule work to blocks instead of a 2d grid. Thus it avoids blocks with no work.

This is faster for cuda but slower for hip.

This has a minimal effect

with triad parted fused This has a large effect and makes a block size of 256 as good or better than 1024

always use binary search code

reorder TRIAD_PARTED gpu tuning declarations

These tuning use events to "fork-join" the streams as would be required in more realistic code. Though it would not always have to be done as frequently.

MrBurmark requested review from artv3 and rhornung67 October 25, 2023 00:07

MrBurmark changed the title ~~Add TRIAD_PARTED kernel~~ Add PARTED kernels Oct 25, 2023

MrBurmark requested a review from seanofthemillers October 25, 2023 00:13

MrBurmark force-pushed the feature/burmark1/parted branch 2 times, most recently from 1536b7a to fd328c2 Compare October 25, 2023 16:35

rhornung67 approved these changes Oct 25, 2023

View reviewed changes

MrBurmark force-pushed the feature/burmark1/parted branch from 1d5b8e4 to 9906d4c Compare November 21, 2023 17:44

MrBurmark changed the title ~~Add PARTED kernels~~ WIP: Add PARTED kernels Nov 28, 2023

MrBurmark added 21 commits January 19, 2024 10:24

Add TRIAD_PARTED kernel

027d6f0

This does the same thing as TRIAD but breaks it into multiple for loops over the data instead of a single for loop over the data.

Add TRIAD_PARTED_FUSED kernel

51057e3

Use direct dispatch in RAJA TRIAD_PARTED_FUSED

08e5c9f

Leave in comments of other dispatch options.

Add Geometric partition

f1dc134

This makes each partition a multiple of the size of the prevoius partition

Add reuse tuning of TRIAD_PARTED_FUSED

84e1e6c

This tuning provides a best case scenario where the overhead of capturing the state and synchronizing per rep is removed.

Add openmp TRIAD_PARTED tuning

36fb292

Switch default block size of TRIAD_PARTED_FUSED

8c06b66

Add len to triad_holder and add gpu tuning

7c94b83

The new gpu tuning is a AOS version using triad_holder. This is now in addition to the SOA tuning.

Add a smart memory pool tuning

1bbd169

This copies the basic mempool from RAJA and adds a capability to synchronize as necessary to avoid host device race conditions when memory is needed on the host and but all the memory has been used on the device.

Add SOA reuse tuning

b1d4e24

Add option to shuffle_partition_sizes

00fb6cf

Default is on so the sizes of partitions are not always in non-decreasing order.

fixup part_type

cf08b3e

Change part_size_order to have multiple options

72fd10c

fixup part_type

94bd68f

fixup part_size_order

37e4475

Add scanAOSreuse tuning

c2de49f

This uses a scan and binary search to schedule work to blocks instead of a 2d grid. Thus it avoids blocks with no work.

Add block wide search impl to triad_parted_fused_scan_aos

a072b3f

This is faster for cuda but slower for hip.

force higher alignment on triad_holder

876a16b

Use device memory for hip triad parted fused

431c80e

This has a minimal effect

Use cuda managed device preferred host accessed

9141573

with triad parted fused This has a large effect and makes a block size of 256 as good or better than 1024

Remove block wide search code

bbe8272

always use binary search code

MrBurmark added 18 commits January 19, 2024 10:25

Add some missing includes

8f884f5

reorder TRIAD_PARTED_FUSED gpu tuning declarations

0e567b1

add TRIAD_PARTED stream (non-omp) tuning

ddf9c9d

reorder TRIAD_PARTED gpu tuning declarations

Change res vector in TRIAD_PARTED gpu stream tunings

5926c63

Add gpu event tunings of TRIAD_PARTED

54d8094

These tuning use events to "fork-join" the streams as would be required in more realistic code. Though it would not always have to be done as frequently.

Rename parted_fused tunings

5ba0d3b

Add cuda graph tuning of TRIAD_PARTED_FUSED

9c696d9

Add indirect dispatch tunings to TRIAD_PARTED_FUSED

489f23f

Rename LaggedMemPool and add normal MemPool alias

18da3c7

Add dataspace_allocator

c75aa00

Use LaggedMemPool and dataspace_allocator

5337794

Use MemPool alias and dataspace_allocator

b82f291

fixup dataspace_allocator

1f82e28

fixup includes

457b829

Add openmp compile guards

2a469a3

Get DataSpace for fusers via function

4f143c3

Use getFuserDataSpace in TRIAD_PARTED_FUSED

455baee

Use getFuserDataSpace with comm

f1d0120

MrBurmark force-pushed the feature/burmark1/parted branch from 97c41ec to f1d0120 Compare January 31, 2024 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add PARTED kernels #382

WIP: Add PARTED kernels #382

MrBurmark commented Oct 25, 2023 •

edited

Loading

WIP: Add PARTED kernels #382

Are you sure you want to change the base?

WIP: Add PARTED kernels #382

Conversation

MrBurmark commented Oct 25, 2023 • edited Loading

WIP: Add PARTED kernels

MrBurmark commented Oct 25, 2023 •

edited

Loading