Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

The training hangs after reloading one of master/worker pods #359

Open
dmitsf opened this issue Oct 28, 2021 · 5 comments
Open

The training hangs after reloading one of master/worker pods #359

dmitsf opened this issue Oct 28, 2021 · 5 comments

Comments

@dmitsf
Copy link

dmitsf commented Oct 28, 2021

Hello!
I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading can be different, usually, it's due to Google Cloud Engine node rescheduling. Also, I tried to kill pods myself - the behavior was the same.
Can I avoid this behavior and make training tolerant to pods' reloading?

@gaocegege
Copy link
Member

Can you tell us the pytorch version?

@dmitsf
Copy link
Author

dmitsf commented Oct 29, 2021

I use pytorch 1.9.0.

@gaocegege
Copy link
Member

Are you using torch.distributed.run?

@dmitsf
Copy link
Author

dmitsf commented Oct 29, 2021

I don't use it at the moment.
I followed mnist example to adjust my training script.

@gaocegege
Copy link
Member

Can you please show us the script and the YAML file? PyTorch 1.9 introduced elastic training and it may hang.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants