[BUG] training crash when set --tp-comm-overlap #1274

ltm920716 · 2024-11-05T06:27:28Z

Describe the bug
training crash when set --tp-comm-overlap

To Reproduce
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 1
--use-flash-attn
--sequence-parallel
--tp-comm-overlap
)

docker run --rm --gpus=all --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --ipc=host -v /mnt/data01/fake_data:/home nvcr.io/nvidia/pytorch:24.04-py3 bash -c "cd /home/Megatron-LM && bash examples/gpt3/single.sh"

Expected behavior
run successfully

Stack trace/logs

/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
/home/Megatron-LM/megatron/training/initialize.py:227: UserWarning: Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.
  warnings.warn(
[fb2a7d718a49:10272:0:10272] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f01bad4bea8)
[fb2a7d718a49:10277:0:10277] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10275:0:10275] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10276:0:10276] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10271:0:10271] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10274:0:10274] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10273:0:10273] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
[fb2a7d718a49:10278:0:10278] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:  10272) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000004d128 ompi_group_increment_proc_count()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:229
 2 0x000000000004d128 opal_atomic_add_fetch_32()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/include/opal/sys/atomic_impl.h:384
 3 0x000000000004d128 opal_thread_add_fetch_32()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/threads/thread_usage.h:152
 4 0x000000000004d128 opal_obj_update()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/class/opal_object.h:534
 5 0x000000000004d128 ompi_group_increment_proc_count()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:226
 6 0x000000000004d9e9 ompi_group_incl_plist()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_plist.c:128
 7 0x000000000007421b PMPI_Group_incl()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pgroup_incl.c:87
 8 0x0000000004f1ea5d c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10277) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10275) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10276) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10271) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10273) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10274) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
==== backtrace (tid:  10278) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1267
 2 0x0000000000042b60 ompi_dpm_group_is_dyn()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1268
 3 0x0000000000042b60 ompi_dpm_mark_dyncomm()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/dpm/dpm.c:1299
 4 0x0000000000034388 ompi_comm_set_nb()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:215
 5 0x00000000000346ba ompi_comm_set()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:116
 6 0x0000000000034ef3 ompi_comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/communicator/comm.c:344
 7 0x000000000006b24a PMPI_Comm_create()  /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pcomm_create.c:66
 8 0x0000000004f1ea87 c10d::ProcessGroupMPI::createProcessGroupMPI()  ???:0
 9 0x0000000000c35470 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> >, std::vector<int, std::allocator<int> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(std::vector<int, std::allocator<int> >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroupMPI> > (*)(std::vector<int, std::allocator<int> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN()  init.cpp:0
10 0x000000000042efb7 pybind11::cpp_function::dispatcher()  :0
11 0x000000000015a10e PyObject_CallFunctionObjArgs()  ???:0
12 0x0000000000150a7b _PyObject_MakeTpCall()  ???:0
13 0x0000000000149629 _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
15 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
16 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
17 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
18 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
19 0x0000000000169492 PyObject_Call()  ???:0
20 0x00000000001455d7 _PyEval_EvalFrameDefault()  ???:0
21 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
22 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
24 0x000000000014326d _PyEval_EvalFrameDefault()  ???:0
25 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
26 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
27 0x000000000015a9fc _PyFunction_Vectorcall()  ???:0
28 0x000000000014453c _PyEval_EvalFrameDefault()  ???:0
29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT()  ???:0
30 0x0000000000235256 PyEval_EvalCode()  ???:0
31 0x0000000000260108 PyUnicode_Tailmatch()  ???:0
32 0x00000000002599cb PyInit__collections()  ???:0
33 0x000000000025fe55 PyUnicode_Tailmatch()  ???:0
34 0x000000000025f338 _PyRun_SimpleFileObject()  ???:0
35 0x000000000025ef83 _PyRun_AnyFileObject()  ???:0
36 0x0000000000251a5e Py_RunMain()  ???:0
37 0x000000000022802d Py_BytesMain()  ???:0
38 0x0000000000029d90 __libc_init_first()  ???:0
39 0x0000000000029e40 __libc_start_main()  ???:0
40 0x0000000000227f25 _start()  ???:0
=================================
E1105 06:18:11.067000 140457179309888 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: -11) local_rank: 0 (pid: 10271) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
pretrain_gpt.py FAILED
-------------------------------------------------------
Failures:
[1]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 10272)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10272
[2]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 2 (local_rank: 2)
  exitcode  : -11 (pid: 10273)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10273
[3]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 3 (local_rank: 3)
  exitcode  : -11 (pid: 10274)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10274
[4]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 4 (local_rank: 4)
  exitcode  : -11 (pid: 10275)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10275
[5]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 5 (local_rank: 5)
  exitcode  : -11 (pid: 10276)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10276
[6]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 6 (local_rank: 6)
  exitcode  : -11 (pid: 10277)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10277
[7]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 7 (local_rank: 7)
  exitcode  : -11 (pid: 10278)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10278
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-05_06:18:10
  host      : fb2a7d718a49
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 10271)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 10271

Environment (please complete the following information):

megatron： 3d27a9d
image：nvcr.io/nvidia/pytorch:24.04-py3

Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

wplf · 2024-11-06T07:52:10Z

hello，you can pull up the latest code and use --tp-comm-bootstrap-backend nccl to specific the tp backend.
This might help you.

ltm920716 · 2024-11-06T08:26:32Z

hello，you can pull up the latest code and use --tp-comm-bootstrap-backend nccl to specific the tp backend. This might help you.

hi，
I set bellow：

MODEL_PARALLEL_ARGS=(
        --tensor-model-parallel-size 2
        --pipeline-model-parallel-size 2
        --use-flash-attn
        --sequence-parallel
        --overlap-grad-reduce
        --recompute-activations
        --recompute-granularity selective
        --tp-comm-bootstrap-backend nccl
        --tp-comm-overlap
)

and I pull the latest repo，the same error

wplf · 2024-11-06T08:35:28Z

How about trying the newest TE?
You log shows Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.

ltm920716 · 2024-11-06T09:10:18Z

How about trying the newest TE? You log shows Transformer Engine v1.5.0+6a9edc3 supports only MPI bootstrap backend.

sorry，I have set --transformer-impl local，error still

wplf · 2024-11-06T09:17:31Z

Maybe the local implementation does not support tp-overlap.
I strongly suggest you to use TE.

ltm920716 changed the title ~~[BUG] --tp~~ [BUG] training crash when set --tp-comm-overlap Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] training crash when set --tp-comm-overlap #1274

[BUG] training crash when set --tp-comm-overlap #1274

ltm920716 commented Nov 5, 2024 •

edited

Loading

wplf commented Nov 6, 2024

ltm920716 commented Nov 6, 2024

wplf commented Nov 6, 2024 •

edited

Loading

ltm920716 commented Nov 6, 2024

wplf commented Nov 6, 2024

[BUG] training crash when set --tp-comm-overlap #1274

[BUG] training crash when set --tp-comm-overlap #1274

Comments

ltm920716 commented Nov 5, 2024 • edited Loading

wplf commented Nov 6, 2024

ltm920716 commented Nov 6, 2024

wplf commented Nov 6, 2024 • edited Loading

ltm920716 commented Nov 6, 2024

wplf commented Nov 6, 2024

ltm920716 commented Nov 5, 2024 •

edited

Loading

wplf commented Nov 6, 2024 •

edited

Loading