Skip to content

TL/UCP: transition to barrier for sync for onesided a2a #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wfaderhold21
Copy link
Collaborator

@wfaderhold21 wfaderhold21 commented Mar 17, 2025

What

Switch from using pSync array with atomic increment to TL/UCP barrier for synchronization

Why ?

There are multiple reason to switch to this: knomial barrier scales better and has better performance than atomic increment (see below) and, when PR #1070 is merged, this allows usage of this algorithm with memory handles.

Node Bandwidth
Tested on Thor with 32 nodes 1 PPN

Size This PR Current Algorithm
8 15.62 9.3
16 30.99 18.27
32 61.81 35.65
64 122.12 72.89
128 235.66 145.33
256 478.43 279.93
512 918.63 549.35
1024 1663.85 1022.45
2048 2919.6 1852.31
4096 4571.7 3151.6
8192 6527.59 4753.05
16384 8583.94 7749.33
32768 10381.33 9651.23
65536 11574.64 10882.64
131072 12039.43 11585.79
262144 12456.95 12065.97
524288 12151.77 12350.47
1048576 12785.3 12427.42

Tested on Thor with 32 nodes 32 PPN

Size This PR Current Algorithm
8 93.92 37.46
16 188.45 72.11
32 380.5 110.44
64 754.7 301.35
128 1500.11 460.3
256 2190.99 917.62
512 4178.51 1982.53
1024 7749.3 2867.44
2048 9093.23 4287.41
4096 9529.68 7078.85
8192 9858.2 8398.87
16384 9615.04 9826.92
32768 10004.92 10975.51
65536 10021.39 11901.37
131072 11287.29 11982.72
262144 11635.63 11803.15
524288 11623.31 11894.45
1048576 11621.41 11991.38

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

@janjust janjust force-pushed the topic/a2a-barrier branch from 96449db to eaa8091 Compare March 26, 2025 17:06
@janjust
Copy link
Collaborator

janjust commented Mar 26, 2025

@wfaderhold21 didn't we say we were also going to change the test to reflect oshmem behavior?
edit: nvm, I just realized it's the other PR

@janjust
Copy link
Collaborator

janjust commented Apr 9, 2025

@wfaderhold21
(2) there can be instances where processes leave the alltoall collective before remote writes have been completed.
We had a discussion during our code review, and if I recall we concluded this is not the case in a 2-sided model, correct?
We still need the user to issue a flush on the symmetric heap
Please correct me if I misunderstood.

@wfaderhold21
Copy link
Collaborator Author

@wfaderhold21 (2) there can be instances where processes leave the alltoall collective before remote writes have been completed. We had a discussion during our code review, and if I recall we concluded this is not the case in a 2-sided model, correct? We still need the user to issue a flush on the symmetric heap Please correct me if I misunderstood.

@janjust This is correct. In order to ensure completion of writes to the remote processes, we need to issue a flush.

@nsarka
Copy link
Collaborator

nsarka commented Apr 10, 2025

In order to ensure completion of writes to the remote processes, we need to issue a flush.

Does the flush become a no-op (or just unnecessary) if RC is used? I'm just wondering how the transport changes this requirement (if at all)

@wfaderhold21
Copy link
Collaborator Author

In order to ensure completion of writes to the remote processes, we need to issue a flush.

Does the flush become a no-op (or just unnecessary) if RC is used? I'm just wondering how the transport changes this requirement (if at all)

I believe ordering should be maintained if using RC and a flush is not necessarily required as future PUTs, sends, AMOs should be completed after the PUT, but UCP will return with success on a PUT if only the source buffer is ready for reuse. There's no guarantee that the PUT has been completed at the remote target (e.g., buffered copy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants