Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding shmem_malloc_with_hints interface #259
Adding shmem_malloc_with_hints interface #259
Changes from 3 commits
90bfe71
77f4c85
c1d9ce6
c3125cd
90a3ee4
016bf85
150e79f
7f01139
79a4f10
e0f80af
7c6c93b
45f4540
3b0b10a
5cd618c
bf4de2e
e7d011c
755a4ed
fc143db
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must all PEs allocate from the same special memory? It's not clear if asymmetry can exist. Does this impose additional implicit synchronization for each subset configuration of hints if it cannot satisfy the entire hint list?
Also, what happens if you OR
SHMEM_HINT_NONE
with other hint behaviors?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
All use cases that I have thought of requires the memory to be symmetric (same kind of memory).
Regarding extra synchronization, it depends on the implementation. If the implementations maintain asymmetric memory sizes (say each PE starts with different amount of special memory) on the PEs, you might need the extra synchronization for agreement. Otherwise, I do not see a need. In a way, it is similar to current DRAM allocations. Also, for the implementation we explored, we did not need extra synchronization.
I’m reluctant to add such a constraint. Without such constraint, the implementations are free to explore either approach (symmetric and asymmetric).
Do see value in specifying one way or other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, what would using
long hint = SHMEM_HINT_NONE | SHMEM_HINT_LOW_LAT_MEM | SHMEM_HINT_HIGH_BW_MEM
do?Dropping
SHMEM_HINT_NONE
, what if the platform could provideSHMEM_HINT_LOW_LAT_MEM
orSHMEM_HINT_HIGH_BW_MEM
but not both simultaneously? Will we see application code marked up like this because who doesn't want to use low latency and high bandwidth memory for their application? Does the "best effort" default to a platform-specific precedence? The only feedback is that an allocation succeeded or it did not.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though this is a legal usage, it does not make sense to use. The implementations are allowed to default to shmem_malloc in this case.
My intention with this statement was to provide flexibility for the implementations to optimize as they wish when the user provides multiple hints. Obviously, some combinations of hints might not make sense. In such cases, If the implementations want to give precedence of one hint over others, the proposal allows it. That (assigning priorities to hints) is one way to implement it, but not the only way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? That the function returns NULL for all PEs, no memory has been allocated, and no implicit barrier has occurred? Some applications may use
shmem_malloc
and friends as implicit barriers. This proposal is for an optimization which seems to be a drop in replacement for vanillashmem_malloc
, but dropping the implicit barrier on function exit would change the behavior of the application.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When no allocation is done, “dropping the implicit barrier” is the behavior we have for shmem_malloc in OpenSHMEM 1.4 - please refer page 26 line 40-41. The proposal is aiming to maintain the same behavior for that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. In OpenSHMEM 1.4, that was not the case. This behavior was changed between 1.4 and now in #201.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah… Thanks for correcting that. It has been so long that we debated about this I did not realize it is relatively new. I was looking at my git copy. :)
Can we consider this issue resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-latency relatively to what ? Low latency for local access ? remote access ? both ? only one of those ? Why I would ever what to allocate memory with low-latency and (related to next flag) slow bandwidth ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is symmetric heap’s memory. So, it is for the memory accessed via shmem routines (which I would guess will be dominated by remote access).
If some architectures and use cases require both local access and remote access, it is easy to expand the hints. If the memory is needed only for local access, I guess, the programs should not allocate the memory from the symmetric heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discussed here. Please refer to this discussion (#259 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory is access locally using load and store semantics without shmem. This is how you initialize the memory with some data in a first place. It is also accessed remotely using load and store semantics without going through SHMEM API. My point the we cannot claim that something is low latency (the same true for high BW and most of other hints) since it really depends on who is accessing the data and this runtime information. For example, for Nic it can be some local sram region on the die and for the socket it L1/L2/L3. If you want to allocate memory on the nic for low latency access, just call it NIC memory, GPU memory, L1 memory ,etc. The low latency and high bw definitions are misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we introducing tuning for something that we are deprecating ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always a challenge with tickets (that are developed in parallel) that has impact on each other.
Already discussed here. Please refer to this discussion (#259 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we introducing tuning for something that we are deprecating ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discussed here. Please refer to this discussion (#259 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with ATOMIC hint was not passed but atomic operation was used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't explicitly cover the bad behavior semantics. But, based on the use case experience we have currently, I'm inclined to say that It should not impact correctness. However, the programs might not be able to get the best performance characteristics for that environment.
Do you prefer that we should explicitly say this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial assumption is that if this is hint, implementation can completely ignore it. But we still have issue with this semantics. For HCA optimal location of the memory for Atomics will be on the HCA. For core it will be optimal to have in L1/L2 which is suboptimal for ATOMICs coming from the network and vice-versa. Essentially such hint may actually do more harm to applications. My suggestion is to use explicit names - HCA1, HCA2, GPU, L2, LLC, etc. This will provide very clear description to user what actually happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since in this case you optimize for remote PEs only, I think it should be named SHMEM_HINT_REMOTE_ATOMICS to be clear that you don't optimize for local PEs. Without explicit specification local/remote the hint is not very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the flag was not used, but signal operation was used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm .. Which flag are you referring to, can you please elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above
SHMEM_HINT_SIGNAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we should to state that neither alignment requirements or memory properties, such as cache line size are not get impacted by the hint. We also shell state the memory semantics for local assess (load and store) are not impacted by the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, for the shmem_malloc routine the implementation is free to allocate the memory, which is either cache aligned or not. One of the constraints is that it should be word-aligned. Similarly, the memory access model (which is yet be defined or clarified here #229) will provide certain access guarantees to the memory allocated by shmem_malloc. In both cases, the proposal intends to follow the semantics of shmem_malloc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory gets eventually mapped into the core. The core architecture defined multiple way how it can be mapped. Each on of the mapping has own semantics and constrains. For example semantics between normal cacheable (write back) and non-cacheable (WC) is very different. Since user has direct access through the pointer, code that worked on one machine will break on another with exception or even data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to support something like this, you have to remove shmem_ptr and prohibit any direct asses to the memory through load and store semantics. Next you would have to introduce shmem_memcpy function to copy-in-out shmem_malloced region.