Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError when using SageMaker Debugger with PyTorch Distributed Training on SageMaker #392

Open
piyushghai opened this issue Oct 29, 2020 · 2 comments

Comments

@piyushghai
Copy link

piyushghai commented Oct 29, 2020

I am using a custom docker image to run distributed training with PyTorch on SageMaker. The training script is taken from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN. The DLC Image uses pytorch-training:1.6.0-gpu-py3 as the base image.

Following is the error traceback :

[1,9]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 550, in move
[1,13]<stdout>:    os.rename(src, real_dst)
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp' -> '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents'
[1,13]<stdout>:
[1,13]<stdout>:During handling of the above exception, another exception occurred:
[1,13]<stdout>:
[1,13]<stdout>:Traceback (most recent call last):
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[1,13]<stdout>:    "__main__", mod_spec)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/__main__.py", line 7, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 196, in main
[1,13]<stdout>:    run_command_line(args)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/mpi4py/run.py", line 47, in run_command_line
[1,13]<stdout>:    run_path(sys.argv[0], run_name='__main__')
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 263, in run_path
[1,13]<stdout>:    pkg_name=pkg_name, script_name=fname)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 96, in _run_module_code
[1,13]<stdout>:    mod_name, mod_spec, pkg_name, script_name)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
[1,13]<stdout>:    exec(code, run_globals)
[1,13]<stdout>:  File "train_net.py", line 306, in <module>
[1,13]<stdout>:    main()
[1,13]<stdout>:  File "train_net.py", line 298, in main
[1,13]<stdout>:    model = train(cfg, args)
[1,13]<stdout>:  File "train_net.py", line 165, in train
[1,13]<stdout>:    per_iter_end_callback_fn=per_iter_callback_fn,
[1,13]<stdout>:  File "/root/DeepLearningExamples/PyTorch/Segmentation/MaskRCNN/pytorch/maskrcnn_benchmark/engine/trainer.py", line 78, in do_train
[1,13]<stdout>:    loss_dict = model(images, targets)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 724, in _call_impl
[1,13]<stdout>:    result = hook(self, input)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 123, in forward_pre_hook
[1,13]<stdout>:    self._close_writers()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/hook.py", line 433, in _close_writers
[1,13]<stdout>:    self.writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/writer.py", line 201, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/event_file_writer.py", line 125, in close
[1,13]<stdout>:    self._ev_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfevent/events_writer.py", line 63, in close
[1,13]<stdout>:    self.tfrecord_writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/tfrecord/record_writer.py", line 81, in close
[1,13]<stdout>:    self._writer.close()
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/site-packages/smdebug/core/access_layer/file.py", line 53, in close
[1,13]<stdout>:    shutil.move(self.temp_path, self.path)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 564, in move
[1,13]<stdout>:    copy_function(src, real_dst)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 263, in copy2
[1,13]<stdout>:    copyfile(src, dst, follow_symlinks=follow_symlinks)
[1,13]<stdout>:  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
[1,13]<stdout>:    with open(src, 'rb') as fsrc:
[1,13]<stdout>:FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/output/tensors/events/000000000000/000000000000_worker_0.tfevents.tmp'
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 13 in communicator MPI COMMUNICATOR 5 DUP FROM 0
with errorcode 1.

@Vikas-kum

@leleamol
Copy link
Contributor

leleamol commented Nov 2, 2020

The fix is checkedin in for 1.6 which avoids registering hook to non-training activities. It is currently under review.

@Vikas-kum
Copy link
Contributor

@leleamol Can you point to fix PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants