Add persistent cluster tests for g4dn and p4d #10

sunkuamzn · 2024-10-17T20:55:42Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

In case a request completes with an error, a more detailed error message has already been logged elsewhere. Remove this useless additional error message, which (for reasons we don't understand) can end up being printed infinitely in case of an error fixes aws#611 Signed-off-by: Eric Raut <[email protected]>

This changes flush() implementation in such a way that local RDMA read is requested to all NIC rails for multi-rail cases. This is to make sure all data traffic paths are safely flushed from all NICs to GPU. Signed-off-by: Taeil Um <[email protected]>

When mr_cache is not initialized, do not use mr cache in MrReg path. Signed-off-by: Se Wang Oh <[email protected]>

mr_cache granularity is system page size but neuron can use smaller than system page size buffer. Disable the feature for Neuron. Signed-off-by: Se Wang Oh <[email protected]> cr: https://code.amazon.com/reviews/CR-153483548

Currently the tuner is calibrated to work with p5.48xlarge. This commit ensures that in case of other platforms we default to NCCL internal tuner.

Try to eliminate one more global variable that really doesn't need to be global by moving it into the plugin structure. Signed-off-by: Brian Barrett <[email protected]>

Change get_device_from_endpoint() to rdma_endpoint_get_device(), following a more standard <object>_<verb> function naming convention. Signed-off-by: Brian Barrett <[email protected]>

Add accessor functions to retrieve the endpoint and device from a request object, and update the code to use them. Signed-off-by: Brian Barrett <[email protected]>

Similar to a previous commit on rdma, add an accessor function for getting the device pointer from an endpoint. Signed-off-by: Brian Barrett <[email protected]>

Rename get_comm() (and friends) to rdma_device_get_comm() (and friends) and set_comm() to rdma_device_set_comm() to try and make it a bit more clear on which objects the accessors are operating. Signed-off-by: Brian Barrett <[email protected]>

Clean up the naming of the various rail accessor functions to follow a object_verb convention. Signed-off-by: Brian Barrett <[email protected]>

Follow a standard object_verb naming scheme and clarify that the function is going to return an ofi domain, in preparation for adding a plugin domain structure. Signed-off-by: Brian Barrett <[email protected]>

Add a recv comm -> endpoint accessor. Only used in one place, but trying to remove all the places that we assume endpoints / devices have a 1:1 coupling. Signed-off-by: Brian Barrett <[email protected]>

Eliminating the waits for eager messages can cause the sender and receiver to go out-of-sync during a long run of eager messages. For example, at an extreme, the sender could send 1024 eager messages, which are queued by Libfabric and marked complete. Then if the sender received a control message for message 0 from the receiver, it couldn't be unambiguously associated with the first message or the 1024th message, which both have the same sequence number, since the sequence number is only 10 bits. A patch intended to keep them in sync (2479540) also did not resolve the issue. Without `FI_ORDER_SAS` semantics (which the plugin does not request), the completions can be reordered, and the receipt of of sync ctrl message does not guarantee that previous ctrl messages have completed, which can lead to the same problem. A proper design fix is still in progress; this PR re-enables the eager waits in the interim. The supporting refactoring commits of aws#553 are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <[email protected]>

sunkuamzn · 2024-10-18T17:34:34Z

bot:aws:retest

Move the MR rkey ID pool from the rdma and sendrecv device structures to the base device structure. This code is effectively duplicated between the two transports, so there's no real reason not to move it. Signed-off-by: Brian Barrett <[email protected]>

Rather than having code poke at the internal id field to figure out if an idpool is active, add an active() accessor function to add at least a tiny bit of abstraction. Signed-off-by: Brian Barrett <[email protected]>

…ws#669) * param: Change default min_stripe_size to be 128K instead of 64K With 64K min_stripe_size experimental results with 16 nodes showed that the rails are not saturated enough before we are stripping the message. This commit ensures that the default min_stripe_size after which we start to multiplex the message is set to 128K instead of 64K as we observe better saturation of the rails used at 128K. Signed-off-by: Arun Karthik <[email protected]> * scheduler: Change the way the num_of_stripes is calculated Currently no_stripes is calulated with the below formula : num_stripes = MIN(max_nr_rails, CEIL(payload_size / min_stripe_size)) This can result in number of stripes being 3 on a 4 rail system. This commit adjusts the num_stripes to be the largest factor of the num_rails that is less than or equal to the calculated number of stripes. Signed-off-by: Arun Karthik <[email protected]> --------- Signed-off-by: Arun Karthik <[email protected]>

Signed-off-by: Raghu Raja <[email protected]>

Prefix all symbols in the sendrecv interface with sendrecv_ to prevent naming conflicts with similar functions in the rdma protocol. We should update the rdma protocol as well, but this at least gets us out of violating C++'s ODR rules. Signed-off-by: Brian Barrett <[email protected]>

The param implementation stamped out accessor functions in each compilation unit, which created obvious problems with C++'s naming rules. We know we want to rewrite this in the future, so for now just stamp out one implementation in nccl_ofi_param.c that everyone can call, and do some ugly macro hacking to make that a minimum required change. Signed-off-by: Brian Barrett <[email protected]>

sunkuamzn · 2024-10-24T21:42:22Z

bot:aws:retest

Signed-off-by: Sai Sunku <[email protected]>

sunkuamzn · 2024-10-25T23:10:17Z

bot:aws:retest

The combination of rail reordering when creating a device that uses multiple rails and topology generation created a bug that could lead to performance issues or crashes in NCCL. The autogenerated topology file only included the "leader NICs", which were the first provider in the info list. This reordering logic happened *after* topology generation, but before getProperties(), which returns the BDF of the first info (ie leader NIC). If NICs got reordered, the leader NIC could be inconsistent between those two reporting points, which irritates NCCL. This has not been a problem historically, because we either used a static topology file that included all NICs (plugin 1.10 or earlier) *or* we didn't reorder. It's only recently that EFA started providing enough information to reorder and trip the above bug. The fix is to move the reordering logic into topology generation, where it arguably should have been from the start. Now there's no possibility of inconsistent views, because the single source of truth is always used. This patch leaves the group count unhandled by the AWS implementation of the sorter. This is not a problem on existing platforms, and will be fixed in a follow-on commit. Signed-off-by: Brian Barrett <[email protected]>

Signed-off-by: Brian Barrett <[email protected]>

The AWS-specific code to sort devices based on resource sharing on P5/P5e assumed that there were only 2 devices that shared the resources. That's not necessarily true, so rewrite the code to be a little more generic, as well as have better abort mechanisms. Signed-off-by: Brian Barrett <[email protected]>

Use sscanf() instead of manually parsing for getting the lower 8 bits of the node_guide for determining the vf id of the efa device. In addition to being simpler, this also fixes a small bug, where we were parsing the vlue in decimal instead of hex (although since the only two current values are "00" or "01", it didn't actually matter). Signed-off-by: Brian Barrett <[email protected]>

Signed-off-by: Amedeo Sapio <[email protected]>

PR aws#543 has moved the control message to its own dedicated endpoint on a single rail and with its own CQ. To avoid the cost of polling an additional CQ, this commit is making the control endpoint share the CQ with rail 0. Signed-off-by: Amedeo Sapio <[email protected]>

PR aws#543 has moved the control message to its own dedicated endpoint on a single rail. As a result, the control message is not sent on all rails in a round-robin fashion anymore. This has impacted performance in some cases, so we want the option to still distribute the control messages across all rails. This commit is providing an environment variable to choose between a) using a single dedicated endpoint for control messages, sharing the CQ with the data endpoint on rail 0; or b) use as many additional endpoints for the control messages as the number of rails, sharing the CQs with the data endpoints, and use the scheduler to round robin the control messages across all rails. Signed-off-by: Amedeo Sapio <[email protected]>

sunkuamzn · 2024-10-28T18:33:41Z

bot:aws:retest

sunkuamzn · 2024-10-29T00:28:04Z

bot:aws:retest

sunkuamzn · 2024-10-29T02:59:26Z

bot:aws:retest

Signed-off-by: Sai Sunku <[email protected]>

sunkuamzn · 2024-10-29T20:18:26Z

bot:aws:retest

sunkuamzn · 2024-10-29T22:26:56Z

bot:aws:retest

sunkuamzn · 2024-10-30T16:36:26Z

bot:aws:retest

sunkuamzn · 2024-10-31T04:30:28Z

bot:aws:retest

Signed-off-by: Sai Sunku <[email protected]>

rauteric and others added 14 commits October 14, 2024 12:23

sendrecv: fix mr cache disable path

051c290

When mr_cache is not initialized, do not use mr cache in MrReg path. Signed-off-by: Se Wang Oh <[email protected]>

mr_cache: disable mr_cache in Neuron

8980e92

mr_cache granularity is system page size but neuron can use smaller than system page size buffer. Disable the feature for Neuron. Signed-off-by: Se Wang Oh <[email protected]> cr: https://code.amazon.com/reviews/CR-153483548

feat: Make tuner platform specific (aws#657)

900e623

Currently the tuner is calibrated to work with p5.48xlarge. This commit ensures that in case of other platforms we default to NCCL internal tuner.

Move domain_per_thread into plugin structure

7b5f2a8

Try to eliminate one more global variable that really doesn't need to be global by moving it into the plugin structure. Signed-off-by: Brian Barrett <[email protected]>

rdma: Fix endpoint->device accessor function name

8651291

Change get_device_from_endpoint() to rdma_endpoint_get_device(), following a more standard <object>_<verb> function naming convention. Signed-off-by: Brian Barrett <[email protected]>

rdma: Add accessor functions for req

6ecae9a

Add accessor functions to retrieve the endpoint and device from a request object, and update the code to use them. Signed-off-by: Brian Barrett <[email protected]>

sendrecv: Add endpoint to device accessor function

97f37fa

Similar to a previous commit on rdma, add an accessor function for getting the device pointer from an endpoint. Signed-off-by: Brian Barrett <[email protected]>

rdma: Rename device<->comm accessors

f2cc74f

Rename get_comm() (and friends) to rdma_device_get_comm() (and friends) and set_comm() to rdma_device_set_comm() to try and make it a bit more clear on which objects the accessors are operating. Signed-off-by: Brian Barrett <[email protected]>

rdma: Clean up get_rail accessors

3852830

Clean up the naming of the various rail accessor functions to follow a object_verb convention. Signed-off-by: Brian Barrett <[email protected]>

rdma: Rename endpoint -> fi_domain accessor

0bfc5b1

Follow a standard object_verb naming scheme and clarify that the function is going to return an ofi domain, in preparation for adding a plugin domain structure. Signed-off-by: Brian Barrett <[email protected]>

rdma: Add recv comm -> endpoint accessor

c9e40aa

Add a recv comm -> endpoint accessor. Only used in one place, but trying to remove all the places that we assume endpoints / devices have a 1:1 coupling. Signed-off-by: Brian Barrett <[email protected]>

sunkuamzn requested a review from a-szegel as a code owner October 17, 2024 20:55

sunkuamzn force-pushed the ci-testing branch 3 times, most recently from cabf9f3 to 8aef9a3 Compare October 18, 2024 01:55

bwbarrett and others added 7 commits October 21, 2024 10:29

idpool: add active check function

3220340

Rather than having code poke at the internal id field to figure out if an idpool is active, add an active() accessor function to add at least a tiny bit of abstraction. Signed-off-by: Brian Barrett <[email protected]>

aws: Add platform_data entry for P5en

c86a8d1

Signed-off-by: Raghu Raja <[email protected]>

aws: Update the default latency used for undefined platforms

d39f096

Signed-off-by: Raghu Raja <[email protected]>

.ci/aws: Move p5 capacity to CGK

b18bedf

Signed-off-by: Sai Sunku <[email protected]>

bwbarrett and others added 6 commits October 25, 2024 17:49

mrail: Use size_t instead of int for indexes

b5a37c1

Signed-off-by: Brian Barrett <[email protected]>

Removed unused parameters and leader_ep member

0a5e911

Signed-off-by: Amedeo Sapio <[email protected]>

sunkuamzn added 3 commits October 29, 2024 11:50

Add persistent cluster tests for g4dn and p4d

c753921

Signed-off-by: Sai Sunku <[email protected]>

Point to the test repo

4866a96

Signed-off-by: Sai Sunku <[email protected]>

Hard code lock count to 4

cd4f706

Signed-off-by: Sai Sunku <[email protected]>

sunkuamzn force-pushed the ci-testing branch from 9d87593 to cd4f706 Compare October 29, 2024 18:50

Introduce container specific functions

2e04bfb

Signed-off-by: Sai Sunku <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persistent cluster tests for g4dn and p4d #10

Add persistent cluster tests for g4dn and p4d #10

sunkuamzn commented Oct 17, 2024

sunkuamzn commented Oct 18, 2024

sunkuamzn commented Oct 24, 2024

sunkuamzn commented Oct 25, 2024

sunkuamzn commented Oct 28, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 30, 2024

sunkuamzn commented Oct 31, 2024

Add persistent cluster tests for g4dn and p4d #10

Are you sure you want to change the base?

Add persistent cluster tests for g4dn and p4d #10

Conversation

sunkuamzn commented Oct 17, 2024

sunkuamzn commented Oct 18, 2024

sunkuamzn commented Oct 24, 2024

sunkuamzn commented Oct 25, 2024

sunkuamzn commented Oct 28, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 29, 2024

sunkuamzn commented Oct 30, 2024

sunkuamzn commented Oct 31, 2024