Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for configuration issues #5

Open
raghavendrachari08 opened this issue Sep 22, 2023 · 3 comments
Open

support for configuration issues #5

raghavendrachari08 opened this issue Sep 22, 2023 · 3 comments

Comments

@raghavendrachari08
Copy link

Hi ,
AM trying to bringup the setup for multinode GPU Hugectr training benchmark using the code https://github.com/mlcommons/training_results_v3.0/tree/main/NVIDIA/benchmarks/dlrm_dcnv2/implementations/hugectr

For single node am able to run the benchmark test , but while am executing the multinode (say 2 node) am facing issue shown below , could you please help me resolving this issue??

[HCTR][17:28:15.456][WARNING][RK0][main]: The model name is not specified when creating the solver.
[1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out
[hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed: Endpoint timeout
[hpci5201:103648] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 1
Traceback (most recent call last):
File "/dev/shm/data/hugectl/train.py", line 344, in
model = hugectr.Model(solver, reader, optimizer)
RuntimeError: Runtime error: MPI_ERR_OTHER: known error not in list
MPI_Bcast(&seed, 1, (static_cast<MPI_Datatype> (static_cast<void *> (&(ompi_mpi_unsigned_long_long)))), 0, (static_cast<MPI_Comm> (static_cast<void *> (&(ompi_mpi_comm_world))))) at create (/workspace/dlrm/hugectr/HugeCTR/src/resource_managers/resource_manager_ext.cpp:39)

@EmmaQiaoCh
Copy link

Hi @raghavendrachari08, Could you check mpirun is supported for 2 node in your environment?
mpirun -np 2 hostname

@raghavendrachari08
Copy link
Author

Hi @EmmaQiaoCh ,
Thanks for the reply, yes mpirun was working on both the host , am able to execute the command and seen the hostnames as output

@EmmaQiaoCh
Copy link

Hi @raghavendrachari08
Sorry for the late response. Since it is related with HugeCTR, I opened the issue in HugeCTR repo. Let's discuss it in that issue thread. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants