forked from aws/aws-ofi-nccl
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test #3
Open
rajachan
wants to merge
43
commits into
master
Choose a base branch
from
ci-clang-ver
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Test #3
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Seth Zegelstein <[email protected]>
With switching to this plugin implementation, we want to take advantage of the RDMA protocol on trn1/trn1n platforms without setting environment variables. This commit sets RDMA protocol as default for trn1/trn1n platforms. Signed-off-by: Michael Axtmann <[email protected]>
The NIC grouping code for RDMA protocol groups NICs around accelerators. To find the accelerators, the grouping code uses hard-coded accelerator identifies. Supported hard-coded accelerators are for NVIDIA GPUs and trainium devices. In case of an unknown accelerator, the grouping code would fail and error out. This patch modifies the code such that each libfabric NIC is exposed as one device to the user of the plugin in case of absense of known accelerators around which a libfabric NIC can be grouped. Signed-off-by: Michael Axtmann <[email protected]>
Since a TRN accelerator is composed of multiple cores, the number of trainium accelerators does not necessarily reflect the number of NIC devices that the RDMA protocol should expose to the user. Instead, each core should have a NIC accessible for communication if that many NICs are available. The best approach, for now, is to remove trainium accelerators from the list of accelerators around which NICs are grouped. Consequently, each libfabric NIC is exposed as on NIC device to the user. This provides trainium maximal freedom in routing data over NICs. In the long run, a better solution might be to expose the number of actual cores to the plugin and take that number into account while NIC grouping. Signed-off-by: Michael Axtmann <[email protected]>
Unfortunately, this couldn't be tested without introducing the action to the repo so that the action appeared in the UI. It was implemented with a script issue. This fixes that script issue, and can be tested before merging now that the initial action was introduced to the repository.
A later revision of PR aws#438 introduced a NULL ptr bug in the `endpoint_per_comm = 1` code. Since this option isn't enabled by default, we weren't testing it. Signed-off-by: Eric Raut <[email protected]>
Enable tuner initialization message to be displayed on all NCCL_DEBUG=INFO and display the logs as part of the default subsystem (INIT) Signed-off-by: Arun Karthik <[email protected]>
Signed-off-by: Seth Zegelstein <[email protected]>
the canonical token is __cplusplus not _cplusplus. When this is incorrect, linkage fails as cxx will perform name mangling when including these headers, resulting in a function with that mangle not existing. stack-info: PR: aws#554, branch: aws-nslick/stack/1 Signed-off-by: Nicholas Sielicki <[email protected]>
The control messages are not going to be scheduled over multiple rails, so remove the unnecessary logic around it. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>
For the long message RDMA protocol, we want to make sure that we never starve the sender for data to move, which means prioritizing control messages from the receiver to the sender. This patch moves both the communicator setup and recv control messages to a new endpoint, which is always on device rail 0. Future patches will optimize polling of the control message cq in the send path and setting priority bits on the control cq. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>
If a match isn't found for the current send, poll the control cq to see if the match can be found. While this extends the current send() call, it potentially lowers the time until data transfer starts. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>
c++ requires literals not sit directly next to other literals and must be separated by at least one whitespace token. Signed-off-by: Nicholas Sielicki <[email protected]>
static_assert is a c++ keyword, so when not using c++, create a macro under the same name that aliases _Static_assert to static_assert. This means static_assert can be used in both cases. Signed-off-by: Nicholas Sielicki <[email protected]>
When using designated initializer lists, under C++ only, a warning will be emitted if the declaration order does not match the initialization order if any fields are missing in the initializer list. Reorder these structs to initialize all fields in their declaration order. Signed-off-by: Nicholas Sielicki <[email protected]>
void pointer arithmetic is illegal in c++; cast to uintptr, do the offset, then cast to void* to pass to libfabric. Signed-off-by: Nicholas Sielicki <[email protected]>
When nRanks==nNodes (SPLIT_MASK=0x7 in NCCL-tests) NCCL is telling the tuner that NVLS is not supported, because it is only doing inter node communication and NVLS doesn't works across nodes. So the region that chooses NVLSTree+Simple is ignored by the tuner, and it falls back to NCCL's tuner. The best choice there is Ring+Simple, so this is updating that region. Signed-off-by: Amedeo Sapio <[email protected]>
Tests added for new functionality as well. Signed-off-by: Eric Raut <[email protected]>
For upcoming deferred comm close work Signed-off-by: Eric Raut <[email protected]>
Changes to the plugin will require changing the endpoint ownership from the API to the protocol-specific send/recv communicators. As a necessary step toward this, make the send/recv communicators responsible for freeing their associated endpoints. Signed-off-by: Eric Raut <[email protected]>
Fail send/recv if called on inactive communicator (after close) Signed-off-by: Eric Raut <[email protected]>
Communicators will be added to the cleanup list when NCCL calls `closeSend`/`closeRecv`. Closing the last communicator in the process will be blocking. Future commits will add a message to be sent on communicator close, which will be waited on here before destroying communicator resources. Signed-off-by: Eric Raut <[email protected]>
The close message will be sent on communicator close in an upcoming commit. Signed-off-by: Eric Raut <[email protected]>
The receive communicator, on closing, will send a message indicating it is safe for the send communicator to close resources. The sender will not destroy communicator resources until it receives this message. This will be required for the eager protocol when the sender does not wait for a control message before marking the send complete (upcoming commit). It prevents sender from closing the communicator before the receiver sends a control message, leading to an error on the receiver. Signed-off-by: Eric Raut <[email protected]>
Updates the send_close message to include a count of the number of control messages sent by receiver. Sender will wait to receive all control messages before closing resources. This handles the unlikely case where completions are reordered such that the close message arrives before a control message. Signed-off-by: Eric Raut <[email protected]>
Sender will not wait to receive a control message from receiver for eager messages, and receiver will not send a control message if it has already received eager data. Both communicators will await any outstanding operations on communicator close. Signed-off-by: Eric Raut <[email protected]>
Do not attempt to link nvtx (as was previously required in older versions of NVTX). When using the `nvtx3/' prefix, this dependency is header-only. Signed-off-by: Nicholas Sielicki <[email protected]>
…ws#582) pull_request triggers remain unchanged/unfiltered. Signed-off-by: Nicholas Sielicki <[email protected]>
prefer dwarf debug symbols for usage with nsight, don't use -O0 as it prevents the proper usage of fortify-source, don't omit frame pointers. Signed-off-by: Nicholas Sielicki <[email protected]>
Signed-off-by: Nicholas Sielicki <[email protected]>
This commit fixes a bug introduced by commit 64d8baf "rdma: Move control messages to own endpoint". The domain member of the plugin RDMA device struct is not available for free disposal. It only stores an actual domain in case only one domain is created per device, i.e., when the domain-per-thread feature is disabled. In this case, the domain is copied over from the device object into the plugin RDMA endpoint object. Afterwards, only the domain member of the endpoint should be accessed. Reason is that the device object does not provide an domain in case domain-per-thread feature is disabled. However, said commit above accesses domain member of device object in any case. This commit fixes this bug by accessing the domain member of the endpoint instead. Signed-off-by: Michael Axtmann <[email protected]>
In preparation to adding v5 and v6 API for neuron, separate generic code in neuron's net.h file from v4 API specific code. Move v4 neuron API into separate file. This commit does not introduce functional changes. Signed-off-by: Michael Axtmann <[email protected]>
Add plugin neuron API version v5. The API provides two major changes. 1. The API provides accept and connect functions that must not block for the connection establishment, and instead should return successfully with communicator == NULL with the expectation that is will be called again until communicator != NULL. This is compliant with connection semantics introduced to NVIDIA APIs starting with NVIDIA API v5. 2. The API extends API v4 by rkey extraction function `get_mr_key` and RMA operations `iread()`, `iwrite()`, and `iwrite_inline()` to directly write into remote memory regions via one-sided communication. The RMA operations are only supported in case the `rma_supported` flag in the device property is set to `1`. Data of `iwrite_inline()` may or may not be inlined. Note, this commit does not introduce implementations of the four new API functions and sets `rma_supported` property to `0`. Signed-off-by: Michael Axtmann <[email protected]>
So far, the neuron V5 API interface does not provide implementations for the RMA operations. Instead, it set `rma_supported=0` when user queries device properties. This commit provides those implementations and reports support to user. Signed-off-by: Michael Axtmann <[email protected]>
In a previous commit, RMA iwrite, iwriteInline, and iread API functions have been implemented. This commit adds nvtx and lttng tracepoints to these functions. Signed-off-by: Michael Axtmann <[email protected]>
…ms" (aws#603) This reverts commit 756214e. Commit 756214e was previously merged so that RDMA protocol became the default protocol used on TRN1 instances. This commit was made as RDMA compatible firmware was deployed on all TRN1 instances. However, in order to be able to utilize the new RDMA compatible firmware, users are required to initiate a reboot on their instances. This comes with the side-effect that the plugin will fail to initialize on TRN1's that have no been rebooted. So, this commit will set the default protocol to SEND RECEIVE so that users can use this plugin out of the box without having to reboot their instance. Signed-off-by: Hunter North <[email protected]> Signed-off-by: Nicholas Sielicki <[email protected]>
Newer versions of libfabric are no longer defining this in "rdma/fabric.h", but we were relying on it transiently. Bring it in-tree to unblock compilation against libfabric master. Signed-off-by: Nicholas Sielicki <[email protected]>
AC_LANG_PROGRAM creates main for you, so putting `int main(void)` fails the check due to warnings, even when the header is available. Signed-off-by: Nicholas Sielicki <[email protected]>
be a bit more frugal and don't require pr-ci to run when validating changes to eg: jenkins ci. Signed-off-by: Nicholas Sielicki <[email protected]>
codebuild runners seems to get stuck for us regularly for whatever reason, just use docker and the al2023/al2 images that exist there. Other fixes: use the amzn2023 cuda repo, bump cuda versions that these builds operate against. Signed-off-by: Nicholas Sielicki <[email protected]>
LLVM is in the middle of a release migration and that's breaking package installs. Reverting to clang-18, which is the current "stable" release version. Signed-off-by: Raghu Raja <[email protected]>
One less third-party action dependency. Signed-off-by: Raghu Raja <[email protected]>
rajachan
force-pushed
the
ci-clang-ver
branch
from
September 19, 2024 19:46
420a42e
to
093376b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.