Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test #3

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open

Test #3

wants to merge 43 commits into from

Conversation

rajachan
Copy link
Owner

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

a-szegel and others added 30 commits August 23, 2024 12:18
With switching to this plugin implementation, we want to take
advantage of the RDMA protocol on trn1/trn1n platforms without setting
environment variables.

This commit sets RDMA protocol as default for trn1/trn1n platforms.

Signed-off-by: Michael Axtmann <[email protected]>
The NIC grouping code for RDMA protocol groups NICs around
accelerators. To find the accelerators, the grouping code uses
hard-coded accelerator identifies. Supported hard-coded accelerators
are for NVIDIA GPUs and trainium devices.
In case of an unknown accelerator, the grouping code would fail and
error out.
This patch modifies the code such that each libfabric NIC is exposed
as one device to the user of the plugin in case of absense of known
accelerators around which a libfabric NIC can be grouped.

Signed-off-by: Michael Axtmann <[email protected]>
Since a TRN accelerator is composed of multiple cores, the number of
trainium accelerators does not necessarily reflect the number of NIC
devices that the RDMA protocol should expose to the user. Instead,
each core should have a NIC accessible for communication if that many
NICs are available.
The best approach, for now, is to remove trainium accelerators from
the list of accelerators around which NICs are grouped. Consequently,
each libfabric NIC is exposed as on NIC device to the user. This
provides trainium maximal freedom in routing data over NICs.

In the long run, a better solution might be to expose the number of
actual cores to the plugin and take that number into account while NIC
grouping.

Signed-off-by: Michael Axtmann <[email protected]>
Unfortunately, this couldn't be tested without introducing the action to the
repo so that the action appeared in the UI.  It was implemented with a script
issue. This fixes that script issue, and can be tested before merging now that
the initial action was introduced to the repository.
A later revision of PR aws#438 introduced a NULL ptr bug in the
`endpoint_per_comm = 1` code. Since this option isn't enabled by
default, we weren't testing it.

Signed-off-by: Eric Raut <[email protected]>
Enable tuner initialization message to be displayed on
all NCCL_DEBUG=INFO and display the logs as part of the default
subsystem (INIT)

Signed-off-by: Arun Karthik <[email protected]>
the canonical token is __cplusplus not _cplusplus. When this is
incorrect, linkage fails as cxx will perform name mangling when
including these headers, resulting in a function with that mangle not
existing.

stack-info: PR: aws#554, branch: aws-nslick/stack/1
Signed-off-by: Nicholas Sielicki <[email protected]>
The control messages are not going to be scheduled over multiple rails,
so remove the unnecessary logic around it.

Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
For the long message RDMA protocol, we want to make sure that we
never starve the sender for data to move, which means prioritizing
control messages from the receiver to the sender.  This patch moves
both the communicator setup and recv control messages to a new
endpoint, which is always on device rail 0.  Future patches will
optimize polling of the control message cq in the send path
and setting priority bits on the control cq.

Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
If a match isn't found for the current send, poll the control cq
to see if the match can be found.  While this extends the current
send() call, it potentially lowers the time until data transfer
starts.

Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
c++ requires literals not sit directly next to other literals and must
be separated by at least one whitespace token.
Signed-off-by: Nicholas Sielicki <[email protected]>
static_assert is a c++ keyword, so when not using c++, create a macro
under the same name that aliases _Static_assert to static_assert. This
means static_assert can be used in both cases.
Signed-off-by: Nicholas Sielicki <[email protected]>
When using designated initializer lists, under C++ only, a warning will
be emitted if the declaration order does not match the initialization
order if any fields are missing in the initializer list. Reorder these
structs to initialize all fields in their declaration order.
Signed-off-by: Nicholas Sielicki <[email protected]>
void pointer arithmetic is illegal in c++; cast to uintptr, do the
offset, then cast to void* to pass to libfabric.

Signed-off-by: Nicholas Sielicki <[email protected]>
When nRanks==nNodes (SPLIT_MASK=0x7 in NCCL-tests) NCCL is telling the tuner
that NVLS is not supported, because it is only doing inter node communication
and NVLS doesn't works across nodes.
So the region that chooses NVLSTree+Simple is ignored by the tuner, and it
falls back to NCCL's tuner. The best choice there is Ring+Simple, so this is
updating that region.

Signed-off-by: Amedeo Sapio <[email protected]>
Tests added for new functionality as well.

Signed-off-by: Eric Raut <[email protected]>
For upcoming deferred comm close work

Signed-off-by: Eric Raut <[email protected]>
Changes to the plugin will require changing the endpoint ownership from
the API to the protocol-specific send/recv communicators. As a necessary
step toward this, make the send/recv communicators responsible for
freeing their associated endpoints.

Signed-off-by: Eric Raut <[email protected]>
Fail send/recv if called on inactive communicator (after close)

Signed-off-by: Eric Raut <[email protected]>
Communicators will be added to the cleanup list when NCCL calls
`closeSend`/`closeRecv`. Closing the last communicator in the process
will be blocking.

Future commits will add a message to be sent on communicator close,
which will be waited on here before destroying communicator resources.

Signed-off-by: Eric Raut <[email protected]>
The close message will be sent on communicator close in an upcoming
commit.

Signed-off-by: Eric Raut <[email protected]>
The receive communicator, on closing, will send a message indicating it
is safe for the send communicator to close resources. The sender will
not destroy communicator resources until it receives this message.

This will be required for the eager protocol when the sender does not
wait for a control message before marking the send complete (upcoming
commit). It prevents sender from closing the communicator before the
receiver sends a control message, leading to an error on the receiver.

Signed-off-by: Eric Raut <[email protected]>
Updates the send_close message to include a count of the number of
control messages sent by receiver. Sender will wait to receive all
control messages before closing resources.

This handles the unlikely case where completions are reordered such that
the close message arrives before a control message.

Signed-off-by: Eric Raut <[email protected]>
Sender will not wait to receive a control message from receiver for
eager messages, and receiver will not send a control message if it has
already received eager data.

Both communicators will await any outstanding operations on communicator
close.

Signed-off-by: Eric Raut <[email protected]>
Do not attempt to link nvtx (as was previously required
in older versions of NVTX). When using the `nvtx3/' prefix,
this dependency is header-only.

Signed-off-by: Nicholas Sielicki <[email protected]>
…ws#582)

pull_request triggers remain unchanged/unfiltered.

Signed-off-by: Nicholas Sielicki <[email protected]>
prefer dwarf debug symbols for usage with nsight, don't use -O0 as it
prevents the proper usage of fortify-source, don't omit frame pointers.

Signed-off-by: Nicholas Sielicki <[email protected]>
aws-nslick and others added 13 commits September 14, 2024 15:35
This commit fixes a bug introduced by commit 64d8baf "rdma: Move
control messages to own endpoint".

The domain member of the plugin RDMA device struct is not available
for free disposal. It only stores an actual domain in case only one
domain is created per device, i.e., when the domain-per-thread feature
is disabled. In this case, the domain is copied over from the device
object into the plugin RDMA endpoint object. Afterwards, only the
domain member of the endpoint should be accessed. Reason is that the
device object does not provide an domain in case domain-per-thread
feature is disabled.
However, said commit above accesses domain member of device object in
any case. This commit fixes this bug by accessing the domain member of
the endpoint instead.

Signed-off-by: Michael Axtmann <[email protected]>
In preparation to adding v5 and v6 API for neuron, separate generic
code in neuron's net.h file from v4 API specific code. Move v4 neuron
API into separate file.

This commit does not introduce functional changes.

Signed-off-by: Michael Axtmann <[email protected]>
Add plugin neuron API version v5. The API provides two major changes.

1. The API provides accept and connect functions that must not block
for the connection establishment, and instead should return
successfully with communicator == NULL with the expectation that is
will be called again until communicator != NULL.  This is compliant
with connection semantics introduced to NVIDIA APIs starting with
NVIDIA API v5.

2. The API extends API v4 by rkey extraction function `get_mr_key` and
RMA operations `iread()`, `iwrite()`, and `iwrite_inline()` to directly
write into remote memory regions via one-sided communication.
The RMA operations are only supported in case the `rma_supported` flag
in the device property is set to `1`. Data of `iwrite_inline()` may or
may not be inlined.
Note, this commit does not introduce implementations of the four new
API functions and sets `rma_supported` property to `0`.

Signed-off-by: Michael Axtmann <[email protected]>
So far, the neuron V5 API interface does not provide implementations
for the RMA operations. Instead, it set `rma_supported=0` when user
queries device properties.
This commit provides those implementations and reports support to
user.

Signed-off-by: Michael Axtmann <[email protected]>
In a previous commit, RMA iwrite, iwriteInline, and iread API
functions have been implemented. This commit adds nvtx and lttng
tracepoints to these functions.

Signed-off-by: Michael Axtmann <[email protected]>
…ms" (aws#603)

This reverts commit 756214e.

Commit 756214e was previously merged so that RDMA protocol became the
default protocol used on TRN1 instances. This commit was made as RDMA
compatible firmware was deployed on all TRN1 instances. However, in
order to be able to utilize the new RDMA compatible firmware, users are
required to initiate a reboot on their instances. This comes with the
side-effect that the plugin will fail to initialize on TRN1's that have
no been rebooted. So, this commit will set the default protocol to SEND
RECEIVE so that users can use this plugin out of the box without having
to reboot their instance.

Signed-off-by: Hunter North <[email protected]>
Signed-off-by: Nicholas Sielicki <[email protected]>
Newer versions of libfabric are no longer defining this in
"rdma/fabric.h", but we were relying on it transiently. Bring it in-tree
to unblock compilation against libfabric master.

Signed-off-by: Nicholas Sielicki <[email protected]>
AC_LANG_PROGRAM creates main for you, so putting `int main(void)` fails
the check due to warnings, even when the header is available.

Signed-off-by: Nicholas Sielicki <[email protected]>
be a bit more frugal and don't require pr-ci to run when validating
changes to eg: jenkins ci.

Signed-off-by: Nicholas Sielicki <[email protected]>
codebuild runners seems to get stuck for us regularly for whatever
reason, just use docker and the al2023/al2 images that exist there.
Other fixes: use the amzn2023 cuda repo, bump cuda versions that these
builds operate against.

Signed-off-by: Nicholas Sielicki <[email protected]>
LLVM is in the middle of a release migration and that's breaking package
installs. Reverting to clang-18, which is the current "stable" release
version.

Signed-off-by: Raghu Raja <[email protected]>
One less third-party action dependency.

Signed-off-by: Raghu Raja <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants