forked from aws/aws-ofi-nccl
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add persistent cluster tests for g4dn and p4d #10
Open
sunkuamzn
wants to merge
33
commits into
aws-ofiwg-bot:master
Choose a base branch
from
sunkuamzn:ci-testing
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
In case a request completes with an error, a more detailed error message has already been logged elsewhere. Remove this useless additional error message, which (for reasons we don't understand) can end up being printed infinitely in case of an error fixes aws#611 Signed-off-by: Eric Raut <[email protected]>
This changes flush() implementation in such a way that local RDMA read is requested to all NIC rails for multi-rail cases. This is to make sure all data traffic paths are safely flushed from all NICs to GPU. Signed-off-by: Taeil Um <[email protected]>
When mr_cache is not initialized, do not use mr cache in MrReg path. Signed-off-by: Se Wang Oh <[email protected]>
mr_cache granularity is system page size but neuron can use smaller than system page size buffer. Disable the feature for Neuron. Signed-off-by: Se Wang Oh <[email protected]> cr: https://code.amazon.com/reviews/CR-153483548
Currently the tuner is calibrated to work with p5.48xlarge. This commit ensures that in case of other platforms we default to NCCL internal tuner.
Try to eliminate one more global variable that really doesn't need to be global by moving it into the plugin structure. Signed-off-by: Brian Barrett <[email protected]>
Change get_device_from_endpoint() to rdma_endpoint_get_device(), following a more standard <object>_<verb> function naming convention. Signed-off-by: Brian Barrett <[email protected]>
Add accessor functions to retrieve the endpoint and device from a request object, and update the code to use them. Signed-off-by: Brian Barrett <[email protected]>
Similar to a previous commit on rdma, add an accessor function for getting the device pointer from an endpoint. Signed-off-by: Brian Barrett <[email protected]>
Rename get_comm() (and friends) to rdma_device_get_comm() (and friends) and set_comm() to rdma_device_set_comm() to try and make it a bit more clear on which objects the accessors are operating. Signed-off-by: Brian Barrett <[email protected]>
Clean up the naming of the various rail accessor functions to follow a object_verb convention. Signed-off-by: Brian Barrett <[email protected]>
Follow a standard object_verb naming scheme and clarify that the function is going to return an ofi domain, in preparation for adding a plugin domain structure. Signed-off-by: Brian Barrett <[email protected]>
Add a recv comm -> endpoint accessor. Only used in one place, but trying to remove all the places that we assume endpoints / devices have a 1:1 coupling. Signed-off-by: Brian Barrett <[email protected]>
Eliminating the waits for eager messages can cause the sender and receiver to go out-of-sync during a long run of eager messages. For example, at an extreme, the sender could send 1024 eager messages, which are queued by Libfabric and marked complete. Then if the sender received a control message for message 0 from the receiver, it couldn't be unambiguously associated with the first message or the 1024th message, which both have the same sequence number, since the sequence number is only 10 bits. A patch intended to keep them in sync (2479540) also did not resolve the issue. Without `FI_ORDER_SAS` semantics (which the plugin does not request), the completions can be reordered, and the receipt of of sync ctrl message does not guarantee that previous ctrl messages have completed, which can lead to the same problem. A proper design fix is still in progress; this PR re-enables the eager waits in the interim. The supporting refactoring commits of aws#553 are retained so we can re-enable this feature later. Only the last commit is reverted. The sync patch is also reverted. This reverts commits 2479540 and 7a0fa83. Signed-off-by: Eric Raut <[email protected]>
sunkuamzn
force-pushed
the
ci-testing
branch
3 times, most recently
from
October 18, 2024 01:55
cabf9f3
to
8aef9a3
Compare
bot:aws:retest |
Move the MR rkey ID pool from the rdma and sendrecv device structures to the base device structure. This code is effectively duplicated between the two transports, so there's no real reason not to move it. Signed-off-by: Brian Barrett <[email protected]>
Rather than having code poke at the internal id field to figure out if an idpool is active, add an active() accessor function to add at least a tiny bit of abstraction. Signed-off-by: Brian Barrett <[email protected]>
…ws#669) * param: Change default min_stripe_size to be 128K instead of 64K With 64K min_stripe_size experimental results with 16 nodes showed that the rails are not saturated enough before we are stripping the message. This commit ensures that the default min_stripe_size after which we start to multiplex the message is set to 128K instead of 64K as we observe better saturation of the rails used at 128K. Signed-off-by: Arun Karthik <[email protected]> * scheduler: Change the way the num_of_stripes is calculated Currently no_stripes is calulated with the below formula : num_stripes = MIN(max_nr_rails, CEIL(payload_size / min_stripe_size)) This can result in number of stripes being 3 on a 4 rail system. This commit adjusts the num_stripes to be the largest factor of the num_rails that is less than or equal to the calculated number of stripes. Signed-off-by: Arun Karthik <[email protected]> --------- Signed-off-by: Arun Karthik <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
Signed-off-by: Raghu Raja <[email protected]>
Prefix all symbols in the sendrecv interface with sendrecv_ to prevent naming conflicts with similar functions in the rdma protocol. We should update the rdma protocol as well, but this at least gets us out of violating C++'s ODR rules. Signed-off-by: Brian Barrett <[email protected]>
The param implementation stamped out accessor functions in each compilation unit, which created obvious problems with C++'s naming rules. We know we want to rewrite this in the future, so for now just stamp out one implementation in nccl_ofi_param.c that everyone can call, and do some ugly macro hacking to make that a minimum required change. Signed-off-by: Brian Barrett <[email protected]>
bot:aws:retest |
Signed-off-by: Sai Sunku <[email protected]>
bot:aws:retest |
The combination of rail reordering when creating a device that uses multiple rails and topology generation created a bug that could lead to performance issues or crashes in NCCL. The autogenerated topology file only included the "leader NICs", which were the first provider in the info list. This reordering logic happened *after* topology generation, but before getProperties(), which returns the BDF of the first info (ie leader NIC). If NICs got reordered, the leader NIC could be inconsistent between those two reporting points, which irritates NCCL. This has not been a problem historically, because we either used a static topology file that included all NICs (plugin 1.10 or earlier) *or* we didn't reorder. It's only recently that EFA started providing enough information to reorder and trip the above bug. The fix is to move the reordering logic into topology generation, where it arguably should have been from the start. Now there's no possibility of inconsistent views, because the single source of truth is always used. This patch leaves the group count unhandled by the AWS implementation of the sorter. This is not a problem on existing platforms, and will be fixed in a follow-on commit. Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Brian Barrett <[email protected]>
The AWS-specific code to sort devices based on resource sharing on P5/P5e assumed that there were only 2 devices that shared the resources. That's not necessarily true, so rewrite the code to be a little more generic, as well as have better abort mechanisms. Signed-off-by: Brian Barrett <[email protected]>
Use sscanf() instead of manually parsing for getting the lower 8 bits of the node_guide for determining the vf id of the efa device. In addition to being simpler, this also fixes a small bug, where we were parsing the vlue in decimal instead of hex (although since the only two current values are "00" or "01", it didn't actually matter). Signed-off-by: Brian Barrett <[email protected]>
Signed-off-by: Amedeo Sapio <[email protected]>
PR aws#543 has moved the control message to its own dedicated endpoint on a single rail and with its own CQ. To avoid the cost of polling an additional CQ, this commit is making the control endpoint share the CQ with rail 0. Signed-off-by: Amedeo Sapio <[email protected]>
PR aws#543 has moved the control message to its own dedicated endpoint on a single rail. As a result, the control message is not sent on all rails in a round-robin fashion anymore. This has impacted performance in some cases, so we want the option to still distribute the control messages across all rails. This commit is providing an environment variable to choose between a) using a single dedicated endpoint for control messages, sharing the CQ with the data endpoint on rail 0; or b) use as many additional endpoints for the control messages as the number of rails, sharing the CQs with the data endpoints, and use the scheduler to round robin the control messages across all rails. Signed-off-by: Amedeo Sapio <[email protected]>
bot:aws:retest |
2 similar comments
bot:aws:retest |
bot:aws:retest |
Signed-off-by: Sai Sunku <[email protected]>
Signed-off-by: Sai Sunku <[email protected]>
Signed-off-by: Sai Sunku <[email protected]>
bot:aws:retest |
3 similar comments
bot:aws:retest |
bot:aws:retest |
bot:aws:retest |
Signed-off-by: Sai Sunku <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.