Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding shmem_malloc_with_hints interface #259

Merged
merged 18 commits into from
Nov 14, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions content/shmem_malloc_hints.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@

\apisummary{
Collective memory allocation routine with support for providing hints.
}

\begin{apidefinition}

\begin{Csynopsis}
void *@\FuncDecl{shmem\_malloc\_with\_hints}@(size_t size, long hints);
\end{Csynopsis}

\begin{apiarguments}
\apiargument{IN}{size}{The size, in bytes, of a block to be
allocated from the symmetric heap. This argument is of type \CTYPE{size\_t}}
\apiargument{IN}{hints}{A bit array of hints provided by the user to the implementation}
\end{apiarguments}


\apidescription{

The \FUNC{shmem\_malloc\_with\_hint} like \FUNC{shmem\_malloc} routine returns a pointer to a block of at least
jdinan marked this conversation as resolved.
Show resolved Hide resolved
manjugv marked this conversation as resolved.
Show resolved Hide resolved
\VAR{size} bytes, which shall be suitably aligned so that it may be
assigned to a pointer to any type of object. This space is allocated from
the symmetric heap (similar to \FUNC{shmem\_malloc}). When \VAR{size} is zero,
the \FUNC{shmem\_malloc\_with\_hint} routine performs no action and returns a null pointer.

In addition to the \VAR{size} argument, the \VAR{hint} argument is provided by the user.
manjugv marked this conversation as resolved.
Show resolved Hide resolved
The \VAR{hint} describes the expected manner in which the \openshmem program may use the allocated memory.
manjugv marked this conversation as resolved.
Show resolved Hide resolved
The valid usage hints are described in Table~\ref{usagehints}. Multiple hints are expressed as \CONST{OR} of \VAR{hints}.
manjugv marked this conversation as resolved.
Show resolved Hide resolved

The information provided by the \VAR{hint} is used to optimize for performance by the implementation.
manjugv marked this conversation as resolved.
Show resolved Hide resolved
If the implementation cannot optimize, the behavior is same as \FUNC{shmem\_malloc}.
If more than one hint is provided, the implementation will make the best effort to use one or more hints
to optimize performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must all PEs allocate from the same special memory? It's not clear if asymmetry can exist. Does this impose additional implicit synchronization for each subset configuration of hints if it cannot satisfy the entire hint list?

Also, what happens if you OR SHMEM_HINT_NONE with other hint behaviors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

All use cases that I have thought of requires the memory to be symmetric (same kind of memory).

Regarding extra synchronization, it depends on the implementation. If the implementations maintain asymmetric memory sizes (say each PE starts with different amount of special memory) on the PEs, you might need the extra synchronization for agreement. Otherwise, I do not see a need. In a way, it is similar to current DRAM allocations. Also, for the implementation we explored, we did not need extra synchronization.

I’m reluctant to add such a constraint. Without such constraint, the implementations are free to explore either approach (symmetric and asymmetric).

Do see value in specifying one way or other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM do?

Dropping SHMEM_HINT_NONE, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM or SHMEM_HINT_HIGH_BW_MEM but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM do?

Though this is a legal usage, it does not make sense to use. The implementations are allowed to default to shmem_malloc in this case.

Dropping SHMEM_HINT_NONE, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM or SHMEM_HINT_HIGH_BW_MEM but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.

“If more than one hint is provided, the implementation will make the best effort to use one or more hints to optimize performance. 
“

My intention with this statement was to provide flexibility for the implementations to optimize as they wish when the user provides multiple hints. Obviously, some combinations of hints might not make sense. In such cases, If the implementations want to give precedence of one hint over others, the proposal allows it. That (assigning priorities to hints) is one way to implement it, but not the only way.

The \FUNC{shmem\_malloc\_with\_hint} routine is provided so that multiple \acp{PE} in a program can allocate symmetric,
remotely accessible memory blocks. When no action is performed, these
routines return without performing a barrier. Otherwise, the routine will call a barrier on exit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no action is performed, these routines return without performing a barrier.

What does this mean? That the function returns NULL for all PEs, no memory has been allocated, and no implicit barrier has occurred? Some applications may use shmem_malloc and friends as implicit barriers. This proposal is for an optimization which seems to be a drop in replacement for vanilla shmem_malloc, but dropping the implicit barrier on function exit would change the behavior of the application.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no allocation is done, “dropping the implicit barrier” is the behavior we have for shmem_malloc in OpenSHMEM 1.4 - please refer page 26 line 40-41. The proposal is aiming to maintain the same behavior for that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. In OpenSHMEM 1.4, that was not the case. This behavior was changed between 1.4 and now in #201.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah… Thanks for correcting that. It has been so long that we debated about this I did not realize it is relatively new. I was looking at my git copy. :)

Can we consider this issue resolved?

This ensures that all \acp{PE} participate in the memory allocation, and that the memory on other
\acp{PE} can be used as soon as the local \ac{PE} returns. The implicit barrier performed by this routine will quiet the
default context. It is the user's responsibility to ensure that no communication operations involving the given memory block are pending on
other contexts prior to calling the \FUNC{shmem\_free} and \FUNC{shmem\_realloc} routines.
manjugv marked this conversation as resolved.
Show resolved Hide resolved
The user is also responsible for calling these routines with identical argument(s) on all
\acp{PE}; if differing \VAR{size}, or \VAR{hint} arguments are used, the behavior of the call
manjugv marked this conversation as resolved.
Show resolved Hide resolved
and any subsequent \openshmem calls is undefined.
}

\apireturnvalues{
The \FUNC{shmem\_malloc\_with\_hint} routine returns a pointer to the allocated space;
manjugv marked this conversation as resolved.
Show resolved Hide resolved
otherwise, it returns a null pointer.
}

\apinotes{
}

\apiimpnotes{
}
\begin{longtable}{|p{0.45\textwidth}|p{0.5\textwidth}|}
\hline
\textbf{Hints} & \textbf{Usage hint}
\tabularnewline \hline
\endhead
%%
\LibConstDecl{SHMEM\_HINT\_NONE} &
Behavior same as shmem\_malloc
manjugv marked this conversation as resolved.
Show resolved Hide resolved
\tabularnewline \hline
jdinan marked this conversation as resolved.
Show resolved Hide resolved

\LibConstDecl{SHMEM\_HINT\_LOW\_LAT\_MEM} &
Allocate memory on low-latency storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-latency relatively to what ? Low latency for local access ? remote access ? both ? only one of those ? Why I would ever what to allocate memory with low-latency and (related to next flag) slow bandwidth ?

Copy link
Collaborator Author

@manjugv manjugv Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-latency relatively to what ? Low latency for local access ? remote access ? both ?

This is symmetric heap’s memory. So, it is for the memory accessed via shmem routines (which I would guess will be dominated by remote access).

If some architectures and use cases require both local access and remote access, it is easy to expand the hints. If the memory is needed only for local access, I guess, the programs should not allocate the memory from the symmetric heap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I would ever what to allocate memory with low-latency and (related to next flag) slow bandwidth ?

Already discussed here. Please refer to this discussion (#259 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory is access locally using load and store semantics without shmem. This is how you initialize the memory with some data in a first place. It is also accessed remotely using load and store semantics without going through SHMEM API. My point the we cannot claim that something is low latency (the same true for high BW and most of other hints) since it really depends on who is accessing the data and this runtime information. For example, for Nic it can be some local sram region on the die and for the socket it L1/L2/L3. If you want to allocate memory on the nic for low latency access, just call it NIC memory, GPU memory, L1 memory ,etc. The low latency and high bw definitions are misleading.

\tabularnewline \hline

\LibConstDecl{SHMEM\_HINT\_HIGH\_BW\_MEM} &
Allocate memory on high-bandwidth storage
manjugv marked this conversation as resolved.
Show resolved Hide resolved
\tabularnewline \hline

\LibConstDecl{SHMEM\_HINT\_PSYNC} &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we introducing tuning for something that we are deprecating ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always a challenge with tickets (that are developed in parallel) that has impact on each other.

Already discussed here. Please refer to this discussion (#259 (comment))

Memory used as \CONST{PSYNC} array
\tabularnewline \hline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we introducing tuning for something that we are deprecating ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed here. Please refer to this discussion (#259 (comment))


\LibConstDecl{SHMEM\_HINT\_PWORK} &
Memory used as \CONST{PWORK} array
\tabularnewline \hline

\LibConstDecl{SHMEM\_HINT\_ATOMICS} &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with ATOMIC hint was not passed but atomic operation was used ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't explicitly cover the bad behavior semantics. But, based on the use case experience we have currently, I'm inclined to say that It should not impact correctness. However, the programs might not be able to get the best performance characteristics for that environment.

Do you prefer that we should explicitly say this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial assumption is that if this is hint, implementation can completely ignore it. But we still have issue with this semantics. For HCA optimal location of the memory for Atomics will be on the HCA. For core it will be optimal to have in L1/L2 which is suboptimal for ATOMICs coming from the network and vice-versa. Essentially such hint may actually do more harm to applications. My suggestion is to use explicit names - HCA1, HCA2, GPU, L2, LLC, etc. This will provide very clear description to user what actually happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since in this case you optimize for remote PEs only, I think it should be named SHMEM_HINT_REMOTE_ATOMICS to be clear that you don't optimize for local PEs. Without explicit specification local/remote the hint is not very useful.

Memory used for \VAR{Atomic} operations
\tabularnewline \hline

\LibConstDecl{SHMEM\_HINT\_SIGNAL} &
Memory used for \VAR{signal} operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the flag was not used, but signal operation was used ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm .. Which flag are you referring to, can you please elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above SHMEM_HINT_SIGNAL

\tabularnewline \hline

\TableCaptionRef{Memory usage hints}
\label{usagehints}
\end{longtable}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should to state that neither alignment requirements or memory properties, such as cache line size are not get impacted by the hint. We also shell state the memory semantics for local assess (load and store) are not impacted by the hint.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, for the shmem_malloc routine the implementation is free to allocate the memory, which is either cache aligned or not. One of the constraints is that it should be word-aligned. Similarly, the memory access model (which is yet be defined or clarified here #229) will provide certain access guarantees to the memory allocated by shmem_malloc. In both cases, the proposal intends to follow the semantics of shmem_malloc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory gets eventually mapped into the core. The core architecture defined multiple way how it can be mapped. Each on of the mapping has own semantics and constrains. For example semantics between normal cacheable (write back) and non-cacheable (WC) is very different. Since user has direct access through the pointer, code that worked on one machine will break on another with exception or even data corruption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support something like this, you have to remove shmem_ptr and prohibit any direct asses to the memory through load and store semantics. Next you would have to introduce shmem_memcpy function to copy-in-out shmem_malloced region.


\end{apidefinition}
\newpage
4 changes: 4 additions & 0 deletions main_spec.tex
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ \subsection{Memory Management Routines}
\subsubsection{\textbf{SHMEM\_MALLOC, SHMEM\_FREE, SHMEM\_REALLOC, SHMEM\_ALIGN}}\label{subsec:shfree}
\input{content/shmem_malloc.tex}

\newpage
\subsubsection{\textbf{SHMEM\_MALLOC\_WITH\_HINTS}}\label{subsec:shmmallochint}
\input{content/shmem_malloc_hints.tex}

\subsubsection{\textbf{SHMEM\_CALLOC}}\label{subsec:shmem_calloc}
\input{content/shmem_calloc.tex}

Expand Down