-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
register_mr_buffers:544 NCCL WARN NET/OFI Unable to register memory (type = 2) for device 0. RC: -22, Error: Invalid argument #584
Comments
@bwbarrett I noticed you had helped with some related issues |
Hi, |
@AmedeoSapio I was actually able to get it working with the native pytorch version in the AMI, i.e.
I will try with 4 NICs, but presumably that will just increase bandwidth. This hints that there is some incompatibility between aws-ofi-nccl and the latest torch + torch deps (I have updated the original issue to note that I was installing the latest fresh - i.e. |
Hello. There is a known incompatibility between NCCL 2.19+ and Libfabric from EFA installers before 1.29. I'm guessing using the latest PyTorch will upgrade the NCCL version. Workarounds are any of the following:
|
Hi @rauteric, good to know! Is there any significant performance downside to (1) as that would be the least-invasive for our stack atm? |
No, this setting merely prevents Libfabric from setting a property on a CUDA buffer (sync_memops) that is not needed for NCCL. It shouldn't have any performance impact. |
Gotcha, confirmed that Might be nice for future new users to maybe "pin" this in some fashion under "Known problems/limitations" in an easy-to-find place or have an up-to-date compatibility chart. But for now, guess it's indexed in this ticket :) Thanks again for the help! |
For future searchers, if it's at all possible, please do prefer to update efa.ko and libfabric instead of relying on this environment variable -- this specific workaround doesn't come with a perf hit, but you are missing out on other performance improvements and bug fixes by using older versions, and you should update whenever you can. |
@visatish we've documented a bunch of these efa/nccl related failure modes in awesome-distributed-training repo, i.e. aws-samples/awsome-distributed-training#203 |
Hi,
I'm trying to run a nccl allreduce benchmark on AWS EC2 and running into the following error:
Setup:
2x p4d.24xlarge
"Deep Learning AMI GPU PyTorch 2.1.0 (Ubuntu 20.04)" AMI
Relevant libs (note that I have installed the latest torch 2.4.1 & deps fresh):
torch-2.4.1-cp310-cp310-manylinux1_x86_64.whl
nvidia-nccl-cu12==2.20.5
Single EFA-enabled NIC (note that I know this instance type can support up to 4x, but I'm starting with 1):
Cmd:
From https://github.com/stas00/ml-engineering.git:
Output:
nccl_out.txt
Note this particular portion:
I'm not quite sure what
Error: Invalid argument
could be - any help is appreciated. Thnx!The text was updated successfully, but these errors were encountered: