Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] training crash when set --tp-comm-overlap #1274

Open
ltm920716 opened this issue Nov 5, 2024 · 5 comments
Open

[BUG] training crash when set --tp-comm-overlap #1274

ltm920716 opened this issue Nov 5, 2024 · 5 comments

Comments

@ltm920716
Copy link

ltm920716 commented Nov 5, 2024

Describe the bug
training crash when set --tp-comm-overlap

To Reproduce
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 1
--use-flash-attn
--sequence-parallel
--tp-comm-overlap
)

docker run --rm --gpus=all --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --ipc=host -v /mnt/data01/fake_data:/home nvcr.io/nvidia/pytorch:24.04-py3 bash -c "cd /home/Megatron-LM && bash examples/gpt3/single.sh"

Expected behavior
run successfully

Stack trace/logs

/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
[fb2a7d718a49:10272:0:10272] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f01bad4bea8)
[fb2a7d718a49:10277:0:10277] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10275:0:10275] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10276:0:10276] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10271:0:10271] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10274:0:10274] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10273:0:10273] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10278:0:10278] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:  10272) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004d128 ompi_group_increment_proc_count()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:229
 2 0x000000000004d128 opal_atomic_add_fetch_32()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/include/opal/sys/atomic_impl.h:384
 3 0x000000000004d128 opal_thread_add_fetch_32()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/threads/thread_usage.h:152
 4 0x000000000004d128 opal_obj_update()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/class/opal_object.h:534
 5 0x000000000004d128 ompi_group_increment_proc_count()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:226
 6 0x000000000004d9e9 ompi_group_incl_plist()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_plist.c:128
 7 0x000000000007421b PMPI_Group_incl()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pgroup_incl.c:87
 8 0x0000000004f1ea5d c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10277) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10275) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10276) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10271) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10273) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10274) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10278) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
E1105 06:18:11.067000 140457179309888 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: -11) local_rank: 0 (pid: 10271) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
pretrain_gpt.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 10272)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10272
[2]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 2 (local_rank: 2)
  exitcode  : -11 (pid: 10273)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10273
[3]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 3 (local_rank: 3)
  exitcode  : -11 (pid: 10274)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10274
[4]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 4 (local_rank: 4)
  exitcode  : -11 (pid: 10275)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10275
[5]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 5 (local_rank: 5)
  exitcode  : -11 (pid: 10276)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10276
[6]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 6 (local_rank: 6)
  exitcode  : -11 (pid: 10277)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10277
[7]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 7 (local_rank: 7)
  exitcode  : -11 (pid: 10278)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10278
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 10271)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10271

Environment (please complete the following information):

  • megatron: 3d27a9d
  • image:nvcr.io/nvidia/pytorch:24.04-py3

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

@ltm920716 ltm920716 changed the title [BUG] --tp [BUG] training crash when set --tp-comm-overlap Nov 5, 2024
@wplf
Copy link

wplf commented Nov 6, 2024

hello,you can pull up the latest code and use --tp-comm-bootstrap-backend nccl to specific the tp backend.
This might help you.

@ltm920716
Copy link
Author

hello,you can pull up the latest code and use --tp-comm-bootstrap-backend nccl to specific the tp backend. This might help you.

hi,
I set bellow:

MODEL_PARALLEL_ARGS=(
        --tensor-model-parallel-size 2
        --pipeline-model-parallel-size 2
        --use-flash-attn
        --sequence-parallel
        --overlap-grad-reduce
        --recompute-activations
        --recompute-granularity selective
        --tp-comm-bootstrap-backend nccl
        --tp-comm-overlap
)

and I pull the latest repo,the same error

@wplf
Copy link

wplf commented Nov 6, 2024

How about trying the newest TE?
You log shows Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.

@ltm920716
Copy link
Author

How about trying the newest TE? You log shows Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.

sorry,I have set --transformer-impl local,error still

@wplf
Copy link

wplf commented Nov 6, 2024

Maybe the local implementation does not support tp-overlap.
I strongly suggest you to use TE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants