-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tp_overlap need tensor parallel is equal world size ? #966
Comments
The tensor parallel group can be a subset of the world group. We frequently split the world group into orthogonal tensor-parallel, data-parallel, and pipeline-parallel groups. Based on the error message, it looks like there's an error when NCCL is initializing IPC communicators: TransformerEngine/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp Line 501 in 4a4f05d
To get more information, can you set NCCL_DEBUG=WARN in the environment?
|
@kuangdao TE in general supports As a disclaimer, comm+GEMM overlap is currently an experimental and somewhat fragile feature that is not yet fully supported in TE under all circumstances (and intentionally undocumented). That will change in the near future, as we improve the underlying device-to-device comms code and test it more rigorously on different platforms. |
thanks, i know, i think comm+GEMM overlap is outstanding job, and i hope more documents such as design and Implementation will be give. |
@kuangdao -- we merged some changes to comm+GEMM overlap in the last month specifically to address multi-node mixed DP/TP use-cases. This feature is still restricted to |
i want set tp size = 2 and the global world size = 2
the code is :
and i run with torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) te_sub_group.py
the error is :
the commit id of TransformerEngine is 4a4f05d
and i use the docker image is nvcr.io/nvidia/nemo:24.05
The text was updated successfully, but these errors were encountered: