-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonblocking Collectives #456
base: main
Are you sure you want to change the base?
Changes from 26 commits
4b9f48d
bb13dd6
df261dd
48a012c
d3a7ac9
a3c8b15
1becb00
1cbd79a
f98cdf7
cc72331
a972738
db8528c
558e0d3
ef3ecf1
db79fcb
7d74314
2355bf8
b15c727
030a019
82e5d54
fb0b482
930b39c
96f0b97
ff96cf5
ef49e8f
e40b263
02a539a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
An \openshmem nonblocking collective operation, like a blocking collective | ||
operation, is a group communication operation among the | ||
participants of the team. All \acp{PE} in the team are required to call the | ||
collective operation and each collective operation must be initiated in the same | ||
order across all \acp{PE} while the execution may be performed in any order. | ||
|
||
\begin{enumerate} | ||
|
||
\item Invocation semantics: Upon invocation of a nonblocking collective routine, | ||
the operation is initiated and the routine returns without ensuring completion. All \acp{PE} in the team | ||
must call this routine with identical arguments. | ||
|
||
\item Collective Types: The nonblocking variants supported include the alltoall | ||
and broadcast collectives. All other collective operations such as | ||
reductions, collect, fcollect, barrier, barrier all, alltoalls, sync, and sync all will not have nonblocking variants. | ||
|
||
\item Completion semantics: \openshmem programs can learn the status of the collective operations | ||
using the \FUNC{shmem\_req\_test} routine. The operation is completed after | ||
a call to \FUNC{shmem\_req\_test} or a call to \FUNC{shmem\_req\_wait}. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to what @kwaters4 mentioned. Add something like "Completion of the operation can be observed through one or more calls to \FUNC{shmem_req_test} or a single call to \FUNC{shmem_req_wait}." |
||
\item Threads: While using SHMEM\_THREAD\_MULTIPLE, the \openshmem | ||
programs are not allowed to call multiple collective operations on different threads | ||
and the same team. | ||
|
||
\end{enumerate} | ||
|
||
Note: Like other nonblocking \openshmem operations, the implementations are | ||
expected to asynchronously progress the collective operations. The guidance on | ||
asynchronous progress is provided in Section \ref{subsec:progress}. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
\apisummary{ | ||
Exchanges a fixed amount of contiguous data blocks between all pairs | ||
of \acp{PE} participating in the collective routine. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we mention later that "All PEs in the provided team must participate in the collective.", can we just say that here, instead of keeping it in the abstract? |
||
} | ||
|
||
\begin{apidefinition} | ||
|
||
%% C11 | ||
\begin{C11synopsis} | ||
int @\FuncDecl{shmem\_alltoall\_nb}@(shmem_team_t team, TYPE *dest, const TYPE | ||
*source, size_t nelems, shmem_req_h *request); | ||
\end{C11synopsis} | ||
where \TYPE{} is one of the standard \ac{RMA} types specified by Table \ref{stdrmatypes}. | ||
|
||
\begin{Csynopsis} | ||
\end{Csynopsis} | ||
\begin{CsynopsisCol} | ||
int @\FuncDecl{shmem\_\FuncParam{TYPENAME}\_alltoall\_nb}@(shmem_team_t team, | ||
TYPE *dest, const TYPE *source, size_t nelems, shmem_req_h *request); | ||
\end{CsynopsisCol} | ||
where \TYPE{} is one of the standard \ac{RMA} types and has a corresponding \TYPENAME{} specified by Table \ref{stdrmatypes}. | ||
|
||
\begin{CsynopsisCol} | ||
int @\FuncDecl{shmem\_alltoallmem\_nb}@(shmem_team_t team, void *dest, const | ||
void *source, size_t nelems, shmem_req_h *request); | ||
\end{CsynopsisCol} | ||
|
||
\begin{apiarguments} | ||
|
||
\apiargument{IN}{team}{A valid \openshmem team handle to a team.}% | ||
|
||
\apiargument{OUT}{dest}{Symmetric address of a data object large enough to receive | ||
the combined total of \VAR{nelems} elements from each \ac{PE} in the | ||
team. | ||
The type of \dest{} should match that implied in the SYNOPSIS section.} | ||
\apiargument{IN}{source}{Symmetric address of a data object that contains \VAR{nelems} | ||
elements of data for each \ac{PE} in the team, ordered according to | ||
destination \ac{PE}. | ||
The type of \source{} should match that implied in the SYNOPSIS section.} | ||
\apiargument{IN}{nelems}{ | ||
The number of elements to exchange for each \ac{PE}. | ||
For \FUNC{shmem\_alltoallmem\_nb} it represents bytes. | ||
} | ||
\apiargument{OUT}{request}{An opaque request handle identifying the collective | ||
operation.} | ||
|
||
\end{apiarguments} | ||
|
||
\apidescription{ | ||
The \FUNC{shmem\_alltoall\_nb} routines are collective routines. All | ||
\acp{PE} in the provided team must participate in the collective. If | ||
\VAR{team} compares equal to \LibConstRef{SHMEM\_TEAM\_INVALID} or is | ||
otherwise invalid, the behavior is undefined. | ||
|
||
{\bf Invocation and completion}: A call to the nonblocking alltoall routine initiates the operation and returns | ||
immediately without necessarily completing the operation. On success, | ||
an opaque request handle is created and returned. The | ||
operation is completed after a call to \FUNC{shmem\_req\_test} or | ||
a call to \FUNC{shmem\_req\_wait}. When the operation is complete, the request handle | ||
is deallocated and cannot be reused. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To my understanding, this is currently worded in a way that both I believe the correct language exists in the intro on lines 17 and 18.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kwaters4 line 17 and 18 of the intro (completion semantics) states:
This is the same as here other than the ability to observe the status of the collective. I believe the text you are quoting is older. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I misunderstand. I thought There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I feel like this wording may be the confusing part:
Maybe this wording is more clear: |
||
|
||
Though nonblocking alltoall varies in invocation and completion semantics | ||
when compared to blocking alltoall, the data exchange semantics are similar. | ||
|
||
{\bf Data exchange semantics}: | ||
In this routine, each \ac{PE} | ||
participating in the operation exchanges \VAR{nelems} data elements | ||
with all other \acp{PE} participating in the operation. | ||
The size of a data element is: | ||
\begin{itemize} | ||
\item 8 bits for \FUNC{shmem\_alltoallmem\_nb} | ||
\item \FUNC{sizeof}(\TYPE{}) for alltoall routines taking typed \VAR{source} and \VAR{dest} | ||
\end{itemize} | ||
|
||
The data being sent and received are | ||
stored in a contiguous symmetric data object. The total size of each \ac{PE}'s | ||
\VAR{source} object and \VAR{dest} object is \VAR{nelems} times the size of | ||
an element | ||
times \VAR{N}, where \VAR{N} equals the number of \acp{PE} participating | ||
in the operation. | ||
The \VAR{source} object contains \VAR{N} blocks of data | ||
(where the size of each block is defined by \VAR{nelems}) and each block of data | ||
is sent to a different \ac{PE}. | ||
|
||
The same \dest{} and \source{} | ||
arrays, and same value for nelems | ||
must be passed by all \acp{PE} that participate in the collective. | ||
|
||
Given a \ac{PE} \VAR{i} that is the \kth \ac{PE} | ||
participating in the operation and a \ac{PE} | ||
\VAR{j} that is the \lth \ac{PE} | ||
participating in the operation, | ||
|
||
\ac{PE} \VAR{i} sends the \lth block of its \VAR{source} object to | ||
the \kth block of | ||
the \VAR{dest} object of \ac{PE} \VAR{j}. | ||
|
||
|
||
Like data exchange semantics, the entry and completion | ||
criteria of blocking and nonblocking alltoall are similar. | ||
|
||
{\bf Entry criteria}: Before any \ac{PE} calls a \FUNC{shmem\_alltoall\_nb} routine, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This entry criteria is confusing - if we have the same entry criteria as blocking operations - then the users are supposed to make sure the src/dst buffers are available at the point of entry - meaning the users are expected to sync with themselves before launching the nb collective operation. AFAIU - the src/dst needs to be available when everyone in the team reaches the state - which can be determined during runtime - without the need for explicit sync before launching the nb collective operation. Please clarify the intended semantics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Naveen, my intent is to keep it consistent with the semantics of blocking collectives. I wanted to improve the description. I don’t see how the current wording is violating that. If the wording doesn’t reflect that, then we can fix it. |
||
the following condition must be ensured: | ||
\begin{itemize} | ||
\item The \VAR{dest} data object on all \acp{PE} in the team is | ||
ready to accept the \FUNC{shmem\_alltoall\_nb} data. | ||
\end{itemize} | ||
Otherwise, the behavior is undefined. | ||
|
||
{\bf Completion criteria}: Upon completion, the following is true for | ||
the local PE: | ||
\begin{itemize} | ||
\item Its \VAR{dest} symmetric data object is completely updated and | ||
the data has been copied out of the \VAR{source} data object. | ||
\end{itemize} | ||
} | ||
|
||
\apireturnvalues{ | ||
Zero on successful local completion. Nonzero otherwise. | ||
} | ||
|
||
\end{apidefinition} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
\apisummary{ | ||
Broadcasts a block of data from one \ac{PE} to one or more destination | ||
\acp{PE}. | ||
} | ||
|
||
\begin{apidefinition} | ||
|
||
%% C11 | ||
\begin{C11synopsis} | ||
int @\FuncDecl{shmem\_broadcast\_nb}@(shmem_team_t team, TYPE *dest, const TYPE | ||
*source, size_t nelems, int PE_root, shmem_req_h *request); | ||
\end{C11synopsis} | ||
where \TYPE{} is one of the standard \ac{RMA} types specified by Table \ref{stdrmatypes}. | ||
|
||
%% C/C++ | ||
\begin{Csynopsis} | ||
\end{Csynopsis} | ||
\begin{CsynopsisCol} | ||
int @\FuncDecl{shmem\_\FuncParam{TYPENAME}\_broadcast\_nb}@(shmem_team_t team, TYPE | ||
*dest, const TYPE *source, size_t nelems, int PE_root, shmem_req_h *request); | ||
\end{CsynopsisCol} | ||
where \TYPE{} is one of the standard \ac{RMA} types and has a corresponding \TYPENAME{} specified by Table \ref{stdrmatypes}. | ||
|
||
\begin{CsynopsisCol} | ||
int @\FuncDecl{shmem\_broadcastmem\_nb}@(shmem_team_t team, void *dest, const void | ||
*source, size_t nelems, int PE_root, shmem_req_h *request); | ||
\end{CsynopsisCol} | ||
|
||
\begin{apiarguments} | ||
|
||
\apiargument{IN}{team}{The team over which to perform the operation.}% | ||
|
||
\apiargument{OUT}{dest}{Symmetric address of destination data object. | ||
The type of \dest{} should match that implied in the SYNOPSIS section.} | ||
\apiargument{IN}{source}{Symmetric address of the source data object. | ||
The type of \source{} should match that implied in the SYNOPSIS section.} | ||
\apiargument{IN}{nelems}{ | ||
The number of elements in \source{} and \dest{} arrays. | ||
For \FUNC{shmem\_broadcastmem\_nb}, elements are bytes. | ||
} | ||
\apiargument{IN}{PE\_root}{Zero-based ordinal of the \ac{PE}, with respect to | ||
the team, from which the data is copied.} | ||
\apiargument{OUT}{request}{An opaque request handle identifying the collective | ||
operation.} | ||
|
||
|
||
\end{apiarguments} | ||
|
||
\apidescription{ | ||
\openshmem nonblocking broadcast routines are collective routines over a | ||
valid \openshmem team. | ||
They copy the \source{} data object on the \ac{PE} specified by | ||
\VAR{PE\_root} to the \dest{} data object on the \acp{PE} | ||
participating in the collective operation. | ||
The same \dest{} and \source{} data objects and the same value of | ||
\VAR{PE\_root} must be passed by all \acp{PE} participating in the | ||
collective operation. | ||
|
||
A call to the nonblocking broadcast routine initiates the operation and returns | ||
immediately without necessarily completing the operation. On success, | ||
an opaque request handle is created and returned. The | ||
operation is completed after a call to \FUNC{shmem\_req\_test} or a | ||
call to \FUNC{shmem\_req\_wait}. When the operation is complete, the request handle | ||
is deallocated and cannot be reused. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To my understanding, this is currently worded in a way that both I believe the correct language exists in the intro on lines 17 and 18.
|
||
|
||
Like blocking broadcast, before any \ac{PE} calls a broadcast routine, the following | ||
conditions must be ensured: | ||
\begin{itemize} | ||
\item The \dest{} array on all \acp{PE} participating in the broadcast | ||
is ready to accept the broadcast data. | ||
\item All \acp{PE} in the \VAR{team} argument must participate in | ||
the operation. | ||
\item If the \VAR{team} compares equal to \LibConstRef{SHMEM\_TEAM\_INVALID} or is | ||
otherwise invalid, the behavior is undefined. | ||
\item \ac{PE} numbering is relative to the team. The specified | ||
root \ac{PE} must be a valid \ac{PE} number for the team, | ||
between \CONST{0} and \VAR{N$-$1}, where \VAR{N} is the size of | ||
the team. | ||
\end{itemize} | ||
Otherwise, the behavior is undefined. | ||
|
||
Upon completion of a nonblocking broadcast routine, the following are true for the local | ||
\ac{PE}: | ||
\begin{itemize} | ||
\item The \dest{} data object is updated. | ||
\item If the local \ac{PE} is \VAR{PE\_root}, the data has been copied | ||
out of the \source{} data object. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this change to a separate PR and ask section committee to add it. @davidozog |
||
\end{itemize} | ||
} | ||
|
||
|
||
\apireturnvalues{ | ||
Zero on success and nonzero otherwise. | ||
} | ||
|
||
\apinotes{ | ||
Team handle error checking and integer return codes are currently undefined. | ||
Implementations may define these behaviors as needed, but programs should | ||
ensure portability by doing their own checks for invalid team handles and for | ||
\LibConstRef{SHMEM\_TEAM\_INVALID}. | ||
} | ||
|
||
\end{apidefinition} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
\apisummary{ | ||
The routine outputs the status of the operation identified by the request. | ||
} | ||
|
||
\begin{apidefinition} | ||
|
||
\begin{Csynopsis} | ||
int @\FuncDecl{shmem\_req\_test}@(shmem_req_h *request); | ||
\end{Csynopsis} | ||
|
||
\begin{apiarguments} | ||
|
||
\apiargument{IN}{request}{Request handle} | ||
|
||
\end{apiarguments} | ||
|
||
\apidescription{ | ||
A call to \FUNC{shmem\_req\_test} returns immediately. If the | ||
operation identified by the request is completed, it returns | ||
zero, and the request object is deallocated and set to \LibConstRef{SHMEM\_REQ\_INVALID}. | ||
If the operation is not completed, it returns a non-negative integer. | ||
If the request object is not valid (i.e., it is set to \LibConstRef{SHMEM\_REQ\_INVALID}), | ||
no operation is performed and a negative value is returned. | ||
|
||
In a multithreaded environment, \FUNC{shmem\_req\_test} can be called by | ||
different threads but on different request objects. It is the responsibility | ||
of the \openshmem user to ensure that proper synchronization is used to | ||
prevent race conditions or deadlock. | ||
} | ||
|
||
\apireturnvalues{ | ||
On success returns zero, otherwise returns a nonzero integer. | ||
} | ||
|
||
\end{apidefinition} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
\apisummary{ | ||
The routine waits until a operation identified by a request | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: "an operation" instead of "a operation" |
||
object completes. | ||
} | ||
|
||
\begin{apidefinition} | ||
|
||
\begin{Csynopsis} | ||
int @\FuncDecl{shmem\_req\_wait}@(shmem_req_h *request); | ||
\end{Csynopsis} | ||
|
||
\begin{apiarguments} | ||
|
||
\apiargument{IN}{request}{Request handle} | ||
|
||
\end{apiarguments} | ||
|
||
\apidescription{ | ||
|
||
The \FUNC{shmem\_req\_wait} function is a blocking operation used to | ||
determine whether an operation identified by the request object has | ||
been completed. When the operation is completed, \FUNC{shmem\_req\_wait} returns | ||
zero, and the request object is deallocated and set to \LibConstRef{SHMEM\_REQ\_INVALID}. | ||
If the request object is not valid (i.e., it is set to | ||
\LibConstRef{SHMEM\_REQ\_INVALID}), no operation is performed and a negative | ||
value is returned. | ||
|
||
In a multithreaded environment, \FUNC{shmem\_req\_wait} can be called by different | ||
threads but on different request objects. It is the responsibility of the | ||
\openshmem user to ensure that proper synchronization is used to prevent race | ||
conditions or deadlock. | ||
} | ||
|
||
\apireturnvalues{ | ||
On success returns zero, otherwise returns a negative integer. | ||
} | ||
|
||
\end{apidefinition} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should
sync_all
be supported? My hunch is it seems possibly useful...