FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

piyushghai · 2020-10-29T16:47:50Z

I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses pytorch-training:1.6.0-gpu-py3 as the base image.

Following is the error traceback :

[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>:    os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>:    "__main__", mod_spec)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>:    run_command_line(args)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "train_net.py", line 306, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "train_net.py", line 298, in main
[1,13]<stdout>:    model = train(cfg, args)
[1,13]<stdout>:  File "train_net.py", line 165, in train
[1,13]<stdout>:    per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>:  File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>:    loss_dict = model(images, targets)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>:    result = hook(self, input)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>:    self._close_writers()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>:    self.writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>:    self._ev_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>:    self.tfrecord_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>:    shutil.move(self.temp_path, self.path)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>:    copy_function(src, real_dst)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>:    copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>:    with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

@Vikas-kum

The text was updated successfully, but these errors were encountered:

leleamol · 2020-11-02T18:38:39Z

The fix is checkedin in for 1.6 which avoids registering hook to non-training activities. It is currently under review.

Vikas-kum · 2020-12-08T19:38:24Z

@leleamol Can you point to fix PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

piyushghai commented Oct 29, 2020 •

edited

Loading

leleamol commented Nov 2, 2020

Vikas-kum commented Dec 8, 2020

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

Comments

piyushghai commented Oct 29, 2020 • edited Loading

leleamol commented Nov 2, 2020

Vikas-kum commented Dec 8, 2020

piyushghai commented Oct 29, 2020 •

edited

Loading