-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding shmem_malloc_with_hints interface #259
Adding shmem_malloc_with_hints interface #259
Conversation
If I understand correctly in PR #256 we allocate the whole SHEAP using hints and then use |
@naveen-rn Correct. Please refer #258 . |
content/shmem_malloc_hints.tex
Outdated
The \VAR{hint} could describe the trait of memory storage where memory is allocated or | ||
it could describe the expected manner in which the \openshmem program may use the allocated memory. | ||
The valid storage trait hints are described in Table~\ref{traithints} and the valid usage hints | ||
are described in Table~\ref{usagehints}. Multiple hints are expressed as \CONST{OR} of \VAR{hints}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manjugv
question regarding Multiple hints are expressed as OR of hints.
: how oshmem should react on case if supported less flags than requested? or supported all honts, but not at same buffer (and I believe this case - supported all but not at same buffer - is the most actual)?
it seems OR
combination of traithints
hints should be denied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hoopoepg That is correct. In the last sentence of this paragraph I say that. Some combination of hints does not make sense and some combination of hints might not be supported by a particular implementation. In either case, it can behave like shmem_malloc.
On the core functionality of this proposal - do we need the hints to have both the storage traits and possible usage pattern? Can we just stick to the possible usage pattern, say SHMEM_HINT_AMO, SHMEM_HINT_SIGNAL? Something to discuss further. |
|
||
The \FUNC{shmem\_malloc\_with\_hint} routine is provided so that multiple \acp{PE} in a program can allocate symmetric, | ||
remotely accessible memory blocks. When no action is performed, these | ||
routines return without performing a barrier. Otherwise, the routine will call a barrier on exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no action is performed, these routines return without performing a barrier.
What does this mean? That the function returns NULL for all PEs, no memory has been allocated, and no implicit barrier has occurred? Some applications may use shmem_malloc
and friends as implicit barriers. This proposal is for an optimization which seems to be a drop in replacement for vanilla shmem_malloc
, but dropping the implicit barrier on function exit would change the behavior of the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no allocation is done, “dropping the implicit barrier” is the behavior we have for shmem_malloc in OpenSHMEM 1.4 - please refer page 26 line 40-41. The proposal is aiming to maintain the same behavior for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. In OpenSHMEM 1.4, that was not the case. This behavior was changed between 1.4 and now in #201.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah… Thanks for correcting that. It has been so long that we debated about this I did not realize it is relatively new. I was looking at my git copy. :)
Can we consider this issue resolved?
If the implementation cannot optimize, the behavior is same as \FUNC{shmem\_malloc}. | ||
If more than one hint is provided, the implementation will make the best effort to use one or more hints | ||
to optimize performance. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must all PEs allocate from the same special memory? It's not clear if asymmetry can exist. Does this impose additional implicit synchronization for each subset configuration of hints if it cannot satisfy the entire hint list?
Also, what happens if you OR SHMEM_HINT_NONE
with other hint behaviors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
All use cases that I have thought of requires the memory to be symmetric (same kind of memory).
Regarding extra synchronization, it depends on the implementation. If the implementations maintain asymmetric memory sizes (say each PE starts with different amount of special memory) on the PEs, you might need the extra synchronization for agreement. Otherwise, I do not see a need. In a way, it is similar to current DRAM allocations. Also, for the implementation we explored, we did not need extra synchronization.
I’m reluctant to add such a constraint. Without such constraint, the implementations are free to explore either approach (symmetric and asymmetric).
Do see value in specifying one way or other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM
do?
Dropping SHMEM_HINT_NONE
, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM
or SHMEM_HINT_HIGH_BW_MEM
but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, what would using long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM do?
Though this is a legal usage, it does not make sense to use. The implementations are allowed to default to shmem_malloc in this case.
Dropping SHMEM_HINT_NONE, what if the platform could provide SHMEM_HINT_LOW_LAT_MEM or SHMEM_HINT_HIGH_BW_MEM but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.
“If more than one hint is provided, the implementation will make the best effort to use one or more hints to optimize performance.
“
My intention with this statement was to provide flexibility for the implementations to optimize as they wish when the user provides multiple hints. Obviously, some combinations of hints might not make sense. In such cases, If the implementations want to give precedence of one hint over others, the proposal allows it. That (assigning priorities to hints) is one way to implement it, but not the only way.
Assuming that the #280 Teams Proposal makes it through for 1.5, won't all references to active set synchronization memory and reduction working memory be deprecated? If so, would SHMEM_HINT_PSYNC, SHMEM_HINT_PWORK also be deprecated before they become part of the specification? There seems to be a consistency problem with these two proposals. Would SHMEM_HINT_SIGNAL (from #275) and SHMEM_HINT_ATOMIC have any different/distinct behavior in an implementation? It seems most code (and I believe all of the examples in the specification) use global or static variables for atomics and signals rather than shmem_malloc'ed memory. Should it be recommended to use shmem_malloc_with hints over static/global? Often the lowest latency memory pools have the highest bandwidth (with some exceptions). Are we relying on users to know whether their application needs lower latency or higher bandwidth and to understand which pools the flags implicitly allocate from? What about very large memory pools of slow memory or mapped storage when latency/bandwidth aren't issues (i.e. SHMEM_HINT_BIGMEM)? |
Yes, depends on the implementation - there might be a difference. |
Yes, I agree with your observation. One way to approach such inconsistencies is via chapter committee. In this particular instance, I’m ok not to introduce it or resolve it via chapter committee. |
Since the signal has a separate interface to fetch the signal value and it is atomic only with itself, it provides more\different opportunities for optimization. I can envision implementing them in different ways based on the capabilities of a particular network. On some systems, it might be the same. |
In the usage scenarios that I have come across, we have not mapped symmetric memory to “slow” memory. So, I have not thought about such a hint. That said, it is a valid usage scenario. So, let's discuss this during the reading and see what rest of the committee thinks about adding it. |
In general, managing global/static memory for global access via OpenSHMEM interfaces presents more constraints and thus fewer optimization opportunities. However, I’m still hesitant to provide a blanket recommendation because the performance depends on the system and implementation. We could add some examples where we can show atomics allocated via shmem_malloc interfaces, if it is useful. |
That is correct. |
content/shmem_malloc_hints.tex
Outdated
\tabularnewline \hline | ||
|
||
\LibConstDecl{SHMEM\_HINT\_LOW\_LAT\_MEM} & | ||
Allocate memory on low-latency storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-latency relatively to what ? Low latency for local access ? remote access ? both ? only one of those ? Why I would ever what to allocate memory with low-latency and (related to next flag) slow bandwidth ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-latency relatively to what ? Low latency for local access ? remote access ? both ?
This is symmetric heap’s memory. So, it is for the memory accessed via shmem routines (which I would guess will be dominated by remote access).
If some architectures and use cases require both local access and remote access, it is easy to expand the hints. If the memory is needed only for local access, I guess, the programs should not allocate the memory from the symmetric heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why I would ever what to allocate memory with low-latency and (related to next flag) slow bandwidth ?
Already discussed here. Please refer to this discussion (#259 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory is access locally using load and store semantics without shmem. This is how you initialize the memory with some data in a first place. It is also accessed remotely using load and store semantics without going through SHMEM API. My point the we cannot claim that something is low latency (the same true for high BW and most of other hints) since it really depends on who is accessing the data and this runtime information. For example, for Nic it can be some local sram region on the die and for the socket it L1/L2/L3. If you want to allocate memory on the nic for low latency access, just call it NIC memory, GPU memory, L1 memory ,etc. The low latency and high bw definitions are misleading.
content/shmem_malloc_hints.tex
Outdated
Allocate memory on high-bandwidth storage | ||
\tabularnewline \hline | ||
|
||
\LibConstDecl{SHMEM\_HINT\_PSYNC} & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we introducing tuning for something that we are deprecating ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always a challenge with tickets (that are developed in parallel) that has impact on each other.
Already discussed here. Please refer to this discussion (#259 (comment))
content/shmem_malloc_hints.tex
Outdated
|
||
\LibConstDecl{SHMEM\_HINT\_PSYNC} & | ||
Memory used as \CONST{PSYNC} array | ||
\tabularnewline \hline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we introducing tuning for something that we are deprecating ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discussed here. Please refer to this discussion (#259 (comment))
content/shmem_malloc_hints.tex
Outdated
Memory used as \CONST{PWORK} array | ||
\tabularnewline \hline | ||
|
||
\LibConstDecl{SHMEM\_HINT\_ATOMICS} & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with ATOMIC hint was not passed but atomic operation was used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't explicitly cover the bad behavior semantics. But, based on the use case experience we have currently, I'm inclined to say that It should not impact correctness. However, the programs might not be able to get the best performance characteristics for that environment.
Do you prefer that we should explicitly say this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial assumption is that if this is hint, implementation can completely ignore it. But we still have issue with this semantics. For HCA optimal location of the memory for Atomics will be on the HCA. For core it will be optimal to have in L1/L2 which is suboptimal for ATOMICs coming from the network and vice-versa. Essentially such hint may actually do more harm to applications. My suggestion is to use explicit names - HCA1, HCA2, GPU, L2, LLC, etc. This will provide very clear description to user what actually happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since in this case you optimize for remote PEs only, I think it should be named SHMEM_HINT_REMOTE_ATOMICS to be clear that you don't optimize for local PEs. Without explicit specification local/remote the hint is not very useful.
\tabularnewline \hline | ||
|
||
\LibConstDecl{SHMEM\_HINT\_SIGNAL} & | ||
Memory used for \VAR{signal} operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the flag was not used, but signal operation was used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm .. Which flag are you referring to, can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above SHMEM_HINT_SIGNAL
|
||
\TableCaptionRef{Memory usage hints} | ||
\label{usagehints} | ||
\end{longtable} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we should to state that neither alignment requirements or memory properties, such as cache line size are not get impacted by the hint. We also shell state the memory semantics for local assess (load and store) are not impacted by the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, for the shmem_malloc routine the implementation is free to allocate the memory, which is either cache aligned or not. One of the constraints is that it should be word-aligned. Similarly, the memory access model (which is yet be defined or clarified here #229) will provide certain access guarantees to the memory allocated by shmem_malloc. In both cases, the proposal intends to follow the semantics of shmem_malloc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory gets eventually mapped into the core. The core architecture defined multiple way how it can be mapped. Each on of the mapping has own semantics and constrains. For example semantics between normal cacheable (write back) and non-cacheable (WC) is very different. Since user has direct access through the pointer, code that worked on one machine will break on another with exception or even data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to support something like this, you have to remove shmem_ptr and prohibit any direct asses to the memory through load and store semantics. Next you would have to introduce shmem_memcpy function to copy-in-out shmem_malloced region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Editorial suggestions
Change hint to hints Co-Authored-By: James Dinan <[email protected]>
Co-Authored-By: James Dinan <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Co-Authored-By: Nick Park <[email protected]>
Special ballot changes:
Editorial changes: |
Comments from Intel review of PR:
|
@nspark Could you please update (or dismiss) your review so that the PR is no longer flagged as "changes requested"? Thanks! |
operations performed on this memory are atomic operations, and origin | ||
and target \ac{PE} of the atomic operations do not share a memory domain | ||
.i.e., symmetric objects on the target \ac{PE} is not accessible using | ||
load/store operations from the origin \ac{PE} or vice versa. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I would consider to move the text under SHMEM_HINT_ATOMICS_REMOTE
since this is critical information. BTW, probably the same true for remove SHMEM_HINT_SIGNAL
vs local SHMEM_HINT_SIGNAL
? @naveen-rn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the definition of remote vs local wrt this change, I'm not sure whether there is need to differentiate the local side to the signal. All signals from the source PE are passed-by-value to the remote signal address buffer. A buffer created with SHMEM_HINT_SIGNAL will mostly be updated by some remote PE.
void shmem_put_signal(shmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems,
uint64_t *sig_addr, uint64_t signal, int sig_op, int pe);
I did not see them until this morning and I wanted to answer them before the vote :). So, the answers are a bit brief but hopefully, it makes the point. Also, if need be, we can discuss it during the vote. |
Suggest that we rename |
@manjugv Please let me know when the PR is ready to merge. We may need to use section committee review to approve edits prior to the merge. |
@jdinan The PR is ready to merge. The two changes (1 and 2) we discussed today, I will bring it as separate PRs. So, it is easy to process. Both (1) and (2) we can do it as a chapter committee change. If the committee prefers to vote on (2) that is fine too. (1) Change SHMEM_HINT_SIGNAL to SHMEM_HINT_SIGNAL_REMOTE |
@manjugv Thanks -- will proceed with the merge. Please bring into the discussion of (1) renaming these constants I reviewed the existing text for (2) and I think I misremembered having text indicating how realloc deals with alignment. We may need to deal with the alignment part as a separate ticket. Something to keep in mind as you draft the text for how realloc deals with memory allocated by |
Proposal to address the issue #258 .