You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>: os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>: "__main__", mod_spec)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>: exec(code, run_globals)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>: main()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>: run_command_line(args)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>: run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>: pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>: mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>: exec(code, run_globals)
[1,13]<stdout>: File "train_net.py", line 306, in <module>
[1,13]<stdout>: main()
[1,13]<stdout>: File "train_net.py", line 298, in main
[1,13]<stdout>: model = train(cfg, args)
[1,13]<stdout>: File "train_net.py", line 165, in train
[1,13]<stdout>: per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>: File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>: loss_dict = model(images, targets)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>: result = hook(self, input)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>: self._close_writers()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>: self.writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>: self._writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>: self._ev_writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>: self.tfrecord_writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>: self._writer.close()
[1,13]<stdout>: File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>: shutil.move(self.temp_path, self.path)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>: copy_function(src, real_dst)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>: copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>: File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>: with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.
I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses
pytorch-training:1.6.0-gpu-py3
as the base image.Following is the error traceback :
@Vikas-kum
The text was updated successfully, but these errors were encountered: