Test #3

rajachan · 2024-09-19T19:42:29Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Seth Zegelstein <[email protected]>

With switching to this plugin implementation, we want to take advantage of the RDMA protocol on trn1/trn1n platforms without setting environment variables. This commit sets RDMA protocol as default for trn1/trn1n platforms. Signed-off-by: Michael Axtmann <[email protected]>

The NIC grouping code for RDMA protocol groups NICs around accelerators. To find the accelerators, the grouping code uses hard-coded accelerator identifies. Supported hard-coded accelerators are for NVIDIA GPUs and trainium devices. In case of an unknown accelerator, the grouping code would fail and error out. This patch modifies the code such that each libfabric NIC is exposed as one device to the user of the plugin in case of absense of known accelerators around which a libfabric NIC can be grouped. Signed-off-by: Michael Axtmann <[email protected]>

Since a TRN accelerator is composed of multiple cores, the number of trainium accelerators does not necessarily reflect the number of NIC devices that the RDMA protocol should expose to the user. Instead, each core should have a NIC accessible for communication if that many NICs are available. The best approach, for now, is to remove trainium accelerators from the list of accelerators around which NICs are grouped. Consequently, each libfabric NIC is exposed as on NIC device to the user. This provides trainium maximal freedom in routing data over NICs. In the long run, a better solution might be to expose the number of actual cores to the plugin and take that number into account while NIC grouping. Signed-off-by: Michael Axtmann <[email protected]>

Unfortunately, this couldn't be tested without introducing the action to the repo so that the action appeared in the UI. It was implemented with a script issue. This fixes that script issue, and can be tested before merging now that the initial action was introduced to the repository.

A later revision of PR aws#438 introduced a NULL ptr bug in the `endpoint_per_comm = 1` code. Since this option isn't enabled by default, we weren't testing it. Signed-off-by: Eric Raut <[email protected]>

Enable tuner initialization message to be displayed on all NCCL_DEBUG=INFO and display the logs as part of the default subsystem (INIT) Signed-off-by: Arun Karthik <[email protected]>

Signed-off-by: Seth Zegelstein <[email protected]>

the canonical token is __cplusplus not _cplusplus. When this is incorrect, linkage fails as cxx will perform name mangling when including these headers, resulting in a function with that mangle not existing. stack-info: PR: aws#554, branch: aws-nslick/stack/1 Signed-off-by: Nicholas Sielicki <[email protected]>

The control messages are not going to be scheduled over multiple rails, so remove the unnecessary logic around it. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>

For the long message RDMA protocol, we want to make sure that we never starve the sender for data to move, which means prioritizing control messages from the receiver to the sender. This patch moves both the communicator setup and recv control messages to a new endpoint, which is always on device rail 0. Future patches will optimize polling of the control message cq in the send path and setting priority bits on the control cq. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>

If a match isn't found for the current send, poll the control cq to see if the match can be found. While this extends the current send() call, it potentially lowers the time until data transfer starts. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>

c++ requires literals not sit directly next to other literals and must be separated by at least one whitespace token. Signed-off-by: Nicholas Sielicki <[email protected]>

static_assert is a c++ keyword, so when not using c++, create a macro under the same name that aliases _Static_assert to static_assert. This means static_assert can be used in both cases. Signed-off-by: Nicholas Sielicki <[email protected]>

When using designated initializer lists, under C++ only, a warning will be emitted if the declaration order does not match the initialization order if any fields are missing in the initializer list. Reorder these structs to initialize all fields in their declaration order. Signed-off-by: Nicholas Sielicki <[email protected]>

void pointer arithmetic is illegal in c++; cast to uintptr, do the offset, then cast to void* to pass to libfabric. Signed-off-by: Nicholas Sielicki <[email protected]>

When nRanks==nNodes (SPLIT_MASK=0x7 in NCCL-tests) NCCL is telling the tuner that NVLS is not supported, because it is only doing inter node communication and NVLS doesn't works across nodes. So the region that chooses NVLSTree+Simple is ignored by the tuner, and it falls back to NCCL's tuner. The best choice there is Ring+Simple, so this is updating that region. Signed-off-by: Amedeo Sapio <[email protected]>

Tests added for new functionality as well. Signed-off-by: Eric Raut <[email protected]>

For upcoming deferred comm close work Signed-off-by: Eric Raut <[email protected]>

Changes to the plugin will require changing the endpoint ownership from the API to the protocol-specific send/recv communicators. As a necessary step toward this, make the send/recv communicators responsible for freeing their associated endpoints. Signed-off-by: Eric Raut <[email protected]>

Fail send/recv if called on inactive communicator (after close) Signed-off-by: Eric Raut <[email protected]>

Communicators will be added to the cleanup list when NCCL calls `closeSend`/`closeRecv`. Closing the last communicator in the process will be blocking. Future commits will add a message to be sent on communicator close, which will be waited on here before destroying communicator resources. Signed-off-by: Eric Raut <[email protected]>

The close message will be sent on communicator close in an upcoming commit. Signed-off-by: Eric Raut <[email protected]>

The receive communicator, on closing, will send a message indicating it is safe for the send communicator to close resources. The sender will not destroy communicator resources until it receives this message. This will be required for the eager protocol when the sender does not wait for a control message before marking the send complete (upcoming commit). It prevents sender from closing the communicator before the receiver sends a control message, leading to an error on the receiver. Signed-off-by: Eric Raut <[email protected]>

Updates the send_close message to include a count of the number of control messages sent by receiver. Sender will wait to receive all control messages before closing resources. This handles the unlikely case where completions are reordered such that the close message arrives before a control message. Signed-off-by: Eric Raut <[email protected]>

Sender will not wait to receive a control message from receiver for eager messages, and receiver will not send a control message if it has already received eager data. Both communicators will await any outstanding operations on communicator close. Signed-off-by: Eric Raut <[email protected]>

Do not attempt to link nvtx (as was previously required in older versions of NVTX). When using the `nvtx3/' prefix, this dependency is header-only. Signed-off-by: Nicholas Sielicki <[email protected]>

…ws#582) pull_request triggers remain unchanged/unfiltered. Signed-off-by: Nicholas Sielicki <[email protected]>

prefer dwarf debug symbols for usage with nsight, don't use -O0 as it prevents the proper usage of fortify-source, don't omit frame pointers. Signed-off-by: Nicholas Sielicki <[email protected]>

Signed-off-by: Nicholas Sielicki <[email protected]>

This commit fixes a bug introduced by commit 64d8baf "rdma: Move control messages to own endpoint". The domain member of the plugin RDMA device struct is not available for free disposal. It only stores an actual domain in case only one domain is created per device, i.e., when the domain-per-thread feature is disabled. In this case, the domain is copied over from the device object into the plugin RDMA endpoint object. Afterwards, only the domain member of the endpoint should be accessed. Reason is that the device object does not provide an domain in case domain-per-thread feature is disabled. However, said commit above accesses domain member of device object in any case. This commit fixes this bug by accessing the domain member of the endpoint instead. Signed-off-by: Michael Axtmann <[email protected]>

In preparation to adding v5 and v6 API for neuron, separate generic code in neuron's net.h file from v4 API specific code. Move v4 neuron API into separate file. This commit does not introduce functional changes. Signed-off-by: Michael Axtmann <[email protected]>

Add plugin neuron API version v5. The API provides two major changes. 1. The API provides accept and connect functions that must not block for the connection establishment, and instead should return successfully with communicator == NULL with the expectation that is will be called again until communicator != NULL. This is compliant with connection semantics introduced to NVIDIA APIs starting with NVIDIA API v5. 2. The API extends API v4 by rkey extraction function `get_mr_key` and RMA operations `iread()`, `iwrite()`, and `iwrite_inline()` to directly write into remote memory regions via one-sided communication. The RMA operations are only supported in case the `rma_supported` flag in the device property is set to `1`. Data of `iwrite_inline()` may or may not be inlined. Note, this commit does not introduce implementations of the four new API functions and sets `rma_supported` property to `0`. Signed-off-by: Michael Axtmann <[email protected]>

So far, the neuron V5 API interface does not provide implementations for the RMA operations. Instead, it set `rma_supported=0` when user queries device properties. This commit provides those implementations and reports support to user. Signed-off-by: Michael Axtmann <[email protected]>

In a previous commit, RMA iwrite, iwriteInline, and iread API functions have been implemented. This commit adds nvtx and lttng tracepoints to these functions. Signed-off-by: Michael Axtmann <[email protected]>

…ms" (aws#603) This reverts commit 756214e. Commit 756214e was previously merged so that RDMA protocol became the default protocol used on TRN1 instances. This commit was made as RDMA compatible firmware was deployed on all TRN1 instances. However, in order to be able to utilize the new RDMA compatible firmware, users are required to initiate a reboot on their instances. This comes with the side-effect that the plugin will fail to initialize on TRN1's that have no been rebooted. So, this commit will set the default protocol to SEND RECEIVE so that users can use this plugin out of the box without having to reboot their instance. Signed-off-by: Hunter North <[email protected]> Signed-off-by: Nicholas Sielicki <[email protected]>

Newer versions of libfabric are no longer defining this in "rdma/fabric.h", but we were relying on it transiently. Bring it in-tree to unblock compilation against libfabric master. Signed-off-by: Nicholas Sielicki <[email protected]>

AC_LANG_PROGRAM creates main for you, so putting `int main(void)` fails the check due to warnings, even when the header is available. Signed-off-by: Nicholas Sielicki <[email protected]>

be a bit more frugal and don't require pr-ci to run when validating changes to eg: jenkins ci. Signed-off-by: Nicholas Sielicki <[email protected]>

codebuild runners seems to get stuck for us regularly for whatever reason, just use docker and the al2023/al2 images that exist there. Other fixes: use the amzn2023 cuda repo, bump cuda versions that these builds operate against. Signed-off-by: Nicholas Sielicki <[email protected]>

LLVM is in the middle of a release migration and that's breaking package installs. Reverting to clang-18, which is the current "stable" release version. Signed-off-by: Raghu Raja <[email protected]>

One less third-party action dependency. Signed-off-by: Raghu Raja <[email protected]>

a-szegel and others added 30 commits August 23, 2024 12:18

.ci/aws: Make failures happen in correct stage

5fe5643

Signed-off-by: Seth Zegelstein <[email protected]>

ci: daily cache efa installer

825c598

fix(rdma): endpont_per_comm: NULL ptr bug

fb4ca3f

A later revision of PR aws#438 introduced a NULL ptr bug in the `endpoint_per_comm = 1` code. Since this option isn't enabled by default, we weren't testing it. Signed-off-by: Eric Raut <[email protected]>

tuner: Enable tuner init msg on INFO logs

7c03145

Enable tuner initialization message to be displayed on all NCCL_DEBUG=INFO and display the logs as part of the default subsystem (INIT) Signed-off-by: Arun Karthik <[email protected]>

.ci/aws: Decrease NCCL_TEST iterations to 5

cd94c57

Signed-off-by: Seth Zegelstein <[email protected]>

rdma: Remove scheduler for control messages

0e5db02

The control messages are not going to be scheduled over multiple rails, so remove the unnecessary logic around it. Signed-off-by: Brian Barrett <[email protected]> Signed-off-by: Raghu Raja <[email protected]>

fix(tree): add spaces around PRIu64 (aws#555)

b9bb318

c++ requires literals not sit directly next to other literals and must be separated by at least one whitespace token. Signed-off-by: Nicholas Sielicki <[email protected]>

fix(rdma): fi_{send,write}data: do arithmetic on uintptr (aws#558)

0b6460c

void pointer arithmetic is illegal in c++; cast to uintptr, do the offset, then cast to void* to pass to libfabric. Signed-off-by: Nicholas Sielicki <[email protected]>

deque: add remove, get_front, get_next, foreach

94297ce

Tests added for new functionality as well. Signed-off-by: Eric Raut <[email protected]>

rdma: split send_close and recv_close into separate destroy functions

ad210c5

For upcoming deferred comm close work Signed-off-by: Eric Raut <[email protected]>

rdma: add new communicator active state

163a6a4

Fail send/recv if called on inactive communicator (after close) Signed-off-by: Eric Raut <[email protected]>

rdma: add close_msg message and request types

1ae4528

The close message will be sent on communicator close in an upcoming commit. Signed-off-by: Eric Raut <[email protected]>

fix(tracing): use header-only nvtx3 (aws#590)

7504626

Do not attempt to link nvtx (as was previously required in older versions of NVTX). When using the `nvtx3/' prefix, this dependency is header-only. Signed-off-by: Nicholas Sielicki <[email protected]>

chore(.github/workflows): constrain push triggers to known branches (a…

eed251e

…ws#582) pull_request triggers remain unchanged/unfiltered. Signed-off-by: Nicholas Sielicki <[email protected]>

feat(build): better --enable-debug defaults (aws#596)

6bb7ada

prefer dwarf debug symbols for usage with nsight, don't use -O0 as it prevents the proper usage of fortify-source, don't omit frame pointers. Signed-off-by: Nicholas Sielicki <[email protected]>

aws-nslick and others added 13 commits September 14, 2024 15:35

fix(freelist): use uintptr_t for pointer arithmetic (aws#560)

4e64913

Signed-off-by: Nicholas Sielicki <[email protected]>

trace: Add tracepoints to RMA read and write operations

f1b91b7

In a previous commit, RMA iwrite, iwriteInline, and iread API functions have been implemented. This commit adds nvtx and lttng tracepoints to these functions. Signed-off-by: Michael Axtmann <[email protected]>

fix(valgrind): fix autotools mistake (aws#607)

653a4d7

AC_LANG_PROGRAM creates main for you, so putting `int main(void)` fails the check due to warnings, even when the header is available. Signed-off-by: Nicholas Sielicki <[email protected]>

feat(ci): skip builds on PRs w/o code changes

af36d7a

be a bit more frugal and don't require pr-ci to run when validating changes to eg: jenkins ci. Signed-off-by: Nicholas Sielicki <[email protected]>

ci: Downgrade to clang-18

0f70c1b

LLVM is in the middle of a release migration and that's breaking package installs. Reverting to clang-18, which is the current "stable" release version. Signed-off-by: Raghu Raja <[email protected]>

ci: Drop the use of cache-apt-pkgs-action

093376b

One less third-party action dependency. Signed-off-by: Raghu Raja <[email protected]>

rajachan force-pushed the ci-clang-ver branch from 420a42e to 093376b Compare September 19, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test #3

Test #3

rajachan commented Sep 19, 2024

Test #3

Are you sure you want to change the base?

Test #3

Conversation

rajachan commented Sep 19, 2024