Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclCommInitRank failed #29

Open
vlimant opened this issue Oct 7, 2019 · 0 comments
Open

ncclCommInitRank failed #29

vlimant opened this issue Oct 7, 2019 · 0 comments

Comments

@vlimant
Copy link
Owner

vlimant commented Oct 7, 2019

while running

mpirun -x TERM=linux --map-by node --hostfile hostf --prefix /opt/openmpi-3.1.0 -np 3 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/soware/singularity/ibanks/edge.simg python3 TrainingDriver.py --model cifar10_arch.json --train train_cifar10.list --val test_cifar10.list --loss categorical_crossentropy --epochs 1 --n-process 2 --cache /imdata/ --timeline --batch 1000 --trial-name cifar_3_2_

I get

[1,2]:Traceback (most recent call last):
[1,2]: File "TrainingDriver.py", line 308, in
[1,2]: checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init
[1,2]: self.make_comms(comm)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms
[1,2]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init
[1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init
[1,2]: self.train()
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 557, in train
[1,2]: train_metrics = self.model.train_on_batch( x=batch[0], y=batch[1] )
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 19, in wrapper
[1,2]: return f(*args, **kwargs)
[1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 143, in train_on_batch
[1,2]: return np.asarray(self.model.train_on_batch( **args ))
[1,2]: File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
[1,2]: outputs = self.train_function(ins)
[1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call
[1,2]: return self._call(inputs)
[1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
[1,2]: fetched = self._callable_fn(*array_vals)
[1,2]: File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call
[1,2]: run_metadata_ptr)
[1,2]:tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[1,2]: [[{{node training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_dense_3_BiasAdd_grad_BiasAddGrad_0}}]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant