-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging issue on TPU VM Pod #7912
Comments
Hey there! As you are trying to run on a TPU Pod, you would need to run python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py |
@kaushikb11 I've been running the code in distributed mode. This doesn't help. |
@tgisaturday Could you provide more details? Lightning Version? Minimal example to reproduce the issue? |
And also, where it seems to be failing. |
@kaushikb11 Here are test codes that I'm using: testcode.zip I'm using pytorch-lightning 1.3.5. python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 gan_test_pod.py
python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 boring.py I'm not sure where the boring.py fails, but my personal gan code seems to fail when the Trainer automatically tries to save checkpoints(trainer.save_checkpoint.py). |
It shoulld be Boring script should be working. |
@kaushikb11 Have you ever tried with TPU VM v3-32? Boring script keeps throwing the error. 2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the |
@kaushikb11 There was some typo in running command for boring script. |
@tgisaturday Could you try the Lightning master? |
@kaushikb11 I'll try right away. I found out that boring script successfully runs with 'checkpoint_callback=False' flag. |
@tgisaturday Awesome! It should be resolved. Also, if you face any more issues, feel free to ping me on Lightning Slack! |
@kaushikb11 Using lightning master (1.4.0dev) and saving checkpoint keeps throwing errors...
I guess there's some problems with DDP accelerator when combined with TPU VM. I'm not sure if this is internal TPU VM problem or pytorch-lightning problem. |
@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.
Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket. Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :) |
@kaushikb11 Thank you for spending your time regarding this issue. I'll try training minGPT also. |
@kaushikb11 I removed logging[self.log(..)] from boring script and save_checkpoint works! It's seems that logging is causing the problem. |
@tgisaturday What were you logging? |
commented out every self.log from boring script. |
@kaushikb11 I've been refactoring taming-transformers to run the code on TPU VM. Here's my code. For easier debugging, I've also added fake_data feature. To start training with fake data, run the code by:
The code is working properly on single TPU Node or GPUs but seems to fall into deadlock on the initial stage of training on TPU VM. 76.7 M Trainable params Nothing goes further from here. Got any comments or suggestions? I'm not sure if this is TPU VM's internal problem or lightning's. |
@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock? |
Regarding single TPU Node, I've meant older way of using 8 core TPU (assign CPU vm and pair with TPU), not using newly released TPU VM. Is there any way I can debug my code deeper than Trainer.fit? When I press ctrl+c, my code get's interupted somewhere around tpu spawn. |
@tgisaturday Got it! Let me give it a try today. |
I'm also closely interacting with GCP- side engineers. Please let me know if it is out of lightning's scope. |
Going through your script. Also, note the effective batch size is Getting this error when I did the first script run File "/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 223, in log_metrics
raise ValueError(m) from ex
ValueError:
you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor. |
@kaushikb11 I've resolved single tpu vm issue with the walkaround suggested in #8183. While everything is okay with single TPU VM, I'm still trying to solve logging issue with TPU VM Pod. With my revised taming-transformers-tpu code, the progress bar doesn't appear at all. Since the trainer itself works this seems to be progress bar logging issue with distributed training on tpu vm pod. Any suggestions on where to start looking with pytorch-lightning repo? Ping me on lightning slack if you need to. |
@tgisaturday Yup, I took a look into it. The issue is that the progress bar only appears after it's finished. It's not exactly Lightning but tqdm specific, but we definitely need to figure it out. Here, you could see how Guessing, it doesn't play well with tqdm. Sample script to reproduce the issue & to fix the issue https://github.com/kaushikb11/minGPT/blob/master/tqdm_test.py Would appreciate it if you could give it a look as well on how we could fix it. |
Closing this issue, as it has been resolved by #8258 :) |
🐛 Bug
Please reproduce using the BoringModel
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer.
While running code on Google Cloud TPU VM Pod v3-8 successfully runs,
process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).
To Reproduce
Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).
Expected behavior
Run without crash on v3-32.
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).You can get the script and run it with:
TPU VM Pod Software: v2-alpha
conda
,pip
, source): bulit-in image in v2-alphaAdditional context
I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint.
Here are test codes that I've used.
testcode.zip
The text was updated successfully, but these errors were encountered: