We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
while running
mpirun -x TERM=linux --map-by node --hostfile hostf --prefix /opt/openmpi-3.1.0 -np 3 --tag-output singularity exec --nv -B /imdata/ -B /storage/ /storage/group/gpu/soware/singularity/ibanks/edge.simg python3 TrainingDriver.py --model cifar10_arch.json --train train_cifar10.list --val test_cifar10.list --loss categorical_crossentropy --epochs 1 --n-process 2 --cache /imdata/ --timeline --batch 1000 --trial-name cifar_3_2_
I get
[1,2]:Traceback (most recent call last): [1,2]: File "TrainingDriver.py", line 308, in [1,2]: checkpoint=args.checkpoint, checkpoint_interval=args.checkpoint_interval) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 165, in init [1,2]: self.make_comms(comm) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/manager.py", line 266, in make_comms [1,2]: checkpoint=self.checkpoint, checkpoint_interval=self.checkpoint_interval [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 493, in init [1,2]: checkpoint=checkpoint, checkpoint_interval=checkpoint_interval ) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 121, in init [1,2]: self.train() [1,2]: File "/nfshome/vlimant/NNLO/nnlo/mpi/process.py", line 557, in train [1,2]: train_metrics = self.model.train_on_batch( x=batch[0], y=batch[1] ) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 19, in wrapper [1,2]: return f(*args, **kwargs) [1,2]: File "/nfshome/vlimant/NNLO/nnlo/train/model.py", line 143, in train_on_batch [1,2]: return np.asarray(self.model.train_on_batch( **args )) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch [1,2]: outputs = self.train_function(ins) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call [1,2]: return self._call(inputs) [1,2]: File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call [1,2]: fetched = self._callable_fn(*array_vals) [1,2]: File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call [1,2]: run_metadata_ptr) [1,2]:tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error [1,2]: [[{{node training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_dense_3_BiasAdd_grad_BiasAddGrad_0}}]]
The text was updated successfully, but these errors were encountered:
No branches or pull requests
while running
I get
The text was updated successfully, but these errors were encountered: