Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistent cluster tests for g4dn and p4d #10

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

sunkuamzn
Copy link
Collaborator

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

rauteric and others added 14 commits October 14, 2024 12:23
In case a request completes with an error, a more detailed error message
has already been logged elsewhere. Remove this useless additional error
message, which (for reasons we don't understand) can end up being
printed infinitely in case of an error

fixes aws#611

Signed-off-by: Eric Raut <[email protected]>
This changes flush() implementation in such a way that local
RDMA read is requested to all NIC rails for multi-rail cases.
This is to make sure all data traffic paths are safely flushed
from all NICs to GPU.

Signed-off-by: Taeil Um <[email protected]>
When mr_cache is not initialized, do not use mr cache in MrReg path.

Signed-off-by: Se Wang Oh <[email protected]>
mr_cache granularity is system page size but neuron can use smaller
than system page size buffer. Disable the feature for Neuron.

Signed-off-by: Se Wang Oh <[email protected]>
cr: https://code.amazon.com/reviews/CR-153483548
Currently the tuner is calibrated to work with p5.48xlarge.
This commit ensures that in case of other platforms we default to
NCCL internal tuner.
Try to eliminate one more global variable that really doesn't need to
be global by moving it into the plugin structure.

Signed-off-by: Brian Barrett <[email protected]>
Change get_device_from_endpoint() to rdma_endpoint_get_device(),
following a more standard <object>_<verb> function naming
convention.

Signed-off-by: Brian Barrett <[email protected]>
Add accessor functions to retrieve the endpoint and device from a
request object, and update the code to use them.

Signed-off-by: Brian Barrett <[email protected]>
Similar to a previous commit on rdma, add an accessor function for
getting the device pointer from an endpoint.

Signed-off-by: Brian Barrett <[email protected]>
Rename get_comm() (and friends) to rdma_device_get_comm() (and friends)
and set_comm() to rdma_device_set_comm() to try and make it a bit more
clear on which objects the accessors are operating.

Signed-off-by: Brian Barrett <[email protected]>
Clean up the naming of the various rail accessor functions to follow a
object_verb convention.

Signed-off-by: Brian Barrett <[email protected]>
Follow a standard object_verb naming scheme and clarify that the
function is going to return an ofi domain, in preparation for
adding a plugin domain structure.

Signed-off-by: Brian Barrett <[email protected]>
Add a recv comm -> endpoint accessor.  Only used in one place, but
trying to remove all the places that we assume endpoints / devices
have a 1:1 coupling.

Signed-off-by: Brian Barrett <[email protected]>
Eliminating the waits for eager messages can cause the sender and
receiver to go out-of-sync during a long run of eager messages. For
example, at an extreme, the sender could send 1024 eager messages, which
are queued by Libfabric and marked complete. Then if the sender received
a control message for message 0 from the receiver, it couldn't be
unambiguously associated with the first message or the 1024th message,
which both have the same sequence number, since the sequence number is
only 10 bits.

A patch intended to keep them in sync (2479540) also did not resolve the
issue. Without `FI_ORDER_SAS` semantics (which the plugin does not
request), the completions can be reordered, and the receipt of of sync
ctrl message does not guarantee that previous ctrl messages have
completed, which can lead to the same problem.

A proper design fix is still in progress; this PR re-enables the eager
waits in the interim.

The supporting refactoring commits of
aws#553 are retained so we
can re-enable this feature later. Only the last commit is reverted. The
sync patch is also reverted.

This reverts commits 2479540 and
7a0fa83.

Signed-off-by: Eric Raut <[email protected]>
@sunkuamzn sunkuamzn requested a review from a-szegel as a code owner October 17, 2024 20:55
@sunkuamzn sunkuamzn force-pushed the ci-testing branch 3 times, most recently from cabf9f3 to 8aef9a3 Compare October 18, 2024 01:55
@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

bwbarrett and others added 7 commits October 21, 2024 10:29
Move the MR rkey ID pool from the rdma and sendrecv device structures
to the base device structure.  This code is effectively duplicated
between the two transports, so there's no real reason not to move
it.

Signed-off-by: Brian Barrett <[email protected]>
Rather than having code poke at the internal id field to figure out
if an idpool is active, add an active() accessor function to add
at least a tiny bit of abstraction.

Signed-off-by: Brian Barrett <[email protected]>
…ws#669)

* param: Change default min_stripe_size to be 128K instead of 64K

With 64K min_stripe_size experimental results with 16 nodes showed that
the rails are not saturated enough before we are stripping the message.

This commit ensures that the default min_stripe_size after which we start to
multiplex the message is set to 128K instead of 64K as we observe better
saturation of the rails used at 128K.

Signed-off-by: Arun Karthik <[email protected]>

* scheduler: Change the way the num_of_stripes is calculated

Currently no_stripes is calulated with the below formula :
num_stripes = MIN(max_nr_rails, CEIL(payload_size / min_stripe_size))
This can result in number of stripes being 3 on a 4 rail system.

This commit adjusts the num_stripes to be the largest factor of
the num_rails that is less than or equal to the calculated number of stripes.

Signed-off-by: Arun Karthik <[email protected]>

---------

Signed-off-by: Arun Karthik <[email protected]>
Prefix all symbols in the sendrecv interface with sendrecv_ to prevent
naming conflicts with similar functions in the rdma protocol.  We should
update the rdma protocol as well, but this at least gets us out of
violating C++'s ODR rules.

Signed-off-by: Brian Barrett <[email protected]>
The param implementation stamped out accessor functions in each
compilation unit, which created obvious problems with C++'s naming
rules.  We know we want to rewrite this in the future, so for now
just stamp out one implementation in nccl_ofi_param.c that everyone
can call, and do some ugly macro hacking to make that a minimum
required change.

Signed-off-by: Brian Barrett <[email protected]>
@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

The combination of rail reordering when creating a device that uses
multiple rails and topology generation created a bug that could lead
to performance issues or crashes in NCCL.  The autogenerated topology
file only included the "leader NICs", which were the first provider
in the info list.  This reordering logic happened *after* topology
generation, but before getProperties(), which returns the BDF of the
first info (ie leader NIC).  If NICs got reordered, the leader NIC
could be inconsistent between those two reporting points, which
irritates NCCL.  This has not been a problem historically, because
we either used a static topology file that included all NICs (plugin
1.10 or earlier) *or* we didn't reorder.  It's only recently that EFA
started providing enough information to reorder and trip the above
bug.

The fix is to move the reordering logic into topology generation, where
it arguably should have been from the start.  Now there's no possibility
of inconsistent views, because the single source of truth is always
used.

This patch leaves the group count unhandled by the AWS implementation
of the sorter.  This is not a problem on existing platforms, and will
be fixed in a follow-on commit.

Signed-off-by: Brian Barrett <[email protected]>
bwbarrett and others added 6 commits October 25, 2024 17:49
The AWS-specific code to sort devices based on resource sharing on
P5/P5e assumed that there were only 2 devices that shared the
resources.  That's not necessarily true, so rewrite the code
to be a little more generic, as well as have better abort
mechanisms.

Signed-off-by: Brian Barrett <[email protected]>
Use sscanf() instead of manually parsing for getting the lower 8 bits
of the node_guide for determining the vf id of the efa device.  In
addition to being simpler, this also fixes a small bug, where we
were parsing the vlue in decimal instead of hex (although since the
only two current values are "00" or "01", it didn't actually matter).

Signed-off-by: Brian Barrett <[email protected]>
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail and with its own CQ. To avoid the cost of polling an
additional CQ, this commit is making the control endpoint share the CQ with
rail 0.

Signed-off-by: Amedeo Sapio <[email protected]>
PR aws#543 has moved the control message to its own dedicated
endpoint on a single rail. As a result, the control message is not sent on
all rails in a round-robin fashion anymore. This has impacted performance
in some cases, so we want the option to still distribute the control messages
across all rails.
This commit is providing an environment variable to choose between a) using a
single dedicated endpoint for control messages, sharing the CQ with the data
endpoint on rail 0; or b) use as many additional endpoints for the control
messages as the number of rails, sharing the CQs with the data endpoints, and
use the scheduler to round robin the control messages across all rails.

Signed-off-by: Amedeo Sapio <[email protected]>
@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

2 similar comments
@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

3 similar comments
@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

@sunkuamzn
Copy link
Collaborator Author

bot:aws:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants