Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging issue on TPU VM Pod #7912

Closed
tgisaturday opened this issue Jun 10, 2021 · 26 comments
Closed

Logging issue on TPU VM Pod #7912

tgisaturday opened this issue Jun 10, 2021 · 26 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on
Milestone

Comments

@tgisaturday
Copy link

🐛 Bug

Please reproduce using the BoringModel

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer.
While running code on Google Cloud TPU VM Pod v3-8 successfully runs,
process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).

To Reproduce

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).

Expected behavior

Run without crash on v3-32.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

TPU VM Pod Software: v2-alpha

  • PyTorch Version (e.g., 1.0): 1.8.1
  • OS (e.g., Linux): Ubuntu
  • How you installed PyTorch (conda, pip, source): bulit-in image in v2-alpha

Additional context

I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint.
Here are test codes that I've used.
testcode.zip

@tgisaturday tgisaturday added bug Something isn't working help wanted Open to be worked on labels Jun 10, 2021
@tgisaturday tgisaturday changed the title trainer.save_checkpoint fails in TPU VM Pod Trainer automatic checkpoint saving fails on TPU VM Pod Jun 10, 2021
@tgisaturday tgisaturday changed the title Trainer automatic checkpoint saving fails on TPU VM Pod Trainer.fit() fails on TPU VM Pod Jun 10, 2021
@kaushikb11
Copy link
Contributor

kaushikb11 commented Jun 10, 2021

Hey there! As you are trying to run on a TPU Pod, you would need to run

python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

@tgisaturday
Copy link
Author

tgisaturday commented Jun 10, 2021

Hey there! As you are trying to run on a TPU Pod, you would need to run

python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

@kaushikb11 I've been running the code in distributed mode. This doesn't help.

@kaushikb11
Copy link
Contributor

@tgisaturday Could you provide more details? Lightning Version? Minimal example to reproduce the issue?

@kaushikb11 kaushikb11 self-assigned this Jun 10, 2021
@kaushikb11
Copy link
Contributor

And also, where it seems to be failing.

@tgisaturday
Copy link
Author

tgisaturday commented Jun 10, 2021

@kaushikb11 Here are test codes that I'm using: testcode.zip

I'm using pytorch-lightning 1.3.5.

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name  -- python3 gan_test_pod.py 

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 boring.py

I'm not sure where the boring.py fails, but my personal gan code seems to fail when the Trainer automatically tries to save checkpoints(trainer.save_checkpoint.py).

@kaushikb11
Copy link
Contributor

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

@tgisaturday
Copy link
Author

tgisaturday commented Jun 11, 2021

@kaushikb11 Have you ever tried with TPU VM v3-32? Boring script keeps throwing the error.

2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 48 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. 2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs) 2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers argument (try 48 which is the number of cpus on this machine) in the DataLoader init to improve performance.
2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs)
Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 15.80it/s, loss=1.79, v_num=0]
2021-06-11 00:34:32 10.164.0.13 [3] Exception in device=TPU:24: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] Exception in device=TPU:8: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] Exception in device=TPU:16: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.22 [1] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.8 [2] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.13 [3] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.22 [1] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.8 [2] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.13 [3] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.22 [1] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.8 [2] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.13 [3] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.22 [1] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.8 [2] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.13 [3] return self.run_train()
2021-06-11 00:34:32 10.164.0.22 [1] return self.run_train()
2021-06-11 00:34:32 10.164.0.8 [2] return self.run_train()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.13 [3] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.22 [1] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.8 [2] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.13 [3] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.22 [1] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.8 [2] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.13 [3] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.22 [1] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.8 [2] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.22 [1] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.8 [2] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.13 [3] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.13 [3] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] https://symbolize.stripped_domain/r/?trace=7f358ea885ce,7f358e9ac20f,7f33a145ee81,7f3396fa9692,7f3396f984ea,7f3396f38b4b,7f35362c7e4a,515bd6f,2&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f339433b000-7f33a3ca9e28
2021-06-11 00:34:32 10.164.0.13 [3] *** SIGTERM received by PID 10397 (TID 10397) on cpu 15 from PID 9809; stack trace: ...

@tgisaturday
Copy link
Author

tgisaturday commented Jun 11, 2021

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

@kaushikb11 There was some typo in running command for boring script.

@kaushikb11
Copy link
Contributor

@tgisaturday Could you try the Lightning master?

@tgisaturday
Copy link
Author

tgisaturday commented Jun 11, 2021

@kaushikb11 I'll try right away. I found out that boring script successfully runs with 'checkpoint_callback=False' flag.

@kaushikb11
Copy link
Contributor

@tgisaturday Awesome! It should be resolved. Also, if you face any more issues, feel free to ping me on Lightning Slack!

@tgisaturday
Copy link
Author

tgisaturday commented Jun 11, 2021

@kaushikb11 Using lightning master (1.4.0dev) and saving checkpoint keeps throwing errors...

  • I've first tried to save checkpoints in current working directory. It throws 'file exists error'
  • Next, I've tried to save checkpoints in gcs bucket. It also throws 'file exists error'
  • I've assigned different root dir for each TPU worker by using os.environ["CLOUD_TPU_TASK_ID"] and 'file exists error' was resolved but the process still crashes with 'socket closed(14)'

I guess there's some problems with DDP accelerator when combined with TPU VM. I'm not sure if this is internal TPU VM problem or pytorch-lightning problem.

@kaushikb11
Copy link
Contributor

kaushikb11 commented Jun 11, 2021

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

@tgisaturday
Copy link
Author

tgisaturday commented Jun 11, 2021

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

@kaushikb11 Thank you for spending your time regarding this issue. I'll try training minGPT also.

@tgisaturday
Copy link
Author

@kaushikb11 I removed logging[self.log(..)] from boring script and save_checkpoint works!

It's seems that logging is causing the problem.

@kaushikb11
Copy link
Contributor

@tgisaturday What were you logging?

@tgisaturday
Copy link
Author

@tgisaturday What were you logging?

commented out every self.log from boring script.

@kaushikb11 kaushikb11 changed the title Trainer.fit() fails on TPU VM Pod Logging issue on TPU VM Pod Jun 11, 2021
@tgisaturday
Copy link
Author

tgisaturday commented Jun 21, 2021

@kaushikb11 I've been refactoring taming-transformers to run the code on TPU VM.

Here's my code.
taming-transformers-tpu

For easier debugging, I've also added fake_data feature. To start training with fake data, run the code by:

pip install -r requirements.txt
python main.py --use_tpus --fake_data

The code is working properly on single TPU Node or GPUs but seems to fall into deadlock on the initial stage of training on TPU VM.

76.7 M Trainable params
14.7 M Non-trainable params
91.5 M Total params
182.917 Total estimated model params size (MB)
Epoch 0: 0%| | 0/456 [00:00<?, ?it/s]

Nothing goes further from here. Got any comments or suggestions? I'm not sure if this is TPU VM's internal problem or lightning's.

@kaushikb11
Copy link
Contributor

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

@tgisaturday
Copy link
Author

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

Regarding single TPU Node, I've meant older way of using 8 core TPU (assign CPU vm and pair with TPU), not using newly released TPU VM.

Is there any way I can debug my code deeper than Trainer.fit? When I press ctrl+c, my code get's interupted somewhere around tpu spawn.

@kaushikb11
Copy link
Contributor

@tgisaturday Got it! Let me give it a try today.

@tgisaturday
Copy link
Author

@tgisaturday Got it! Let me give it a try today.

I'm also closely interacting with GCP- side engineers. Please let me know if it is out of lightning's scope.

@kaushikb11
Copy link
Contributor

Going through your script. Also, note the effective batch size is batch size * 8 for 8 cores.

Getting this error when I did the first script run

  File "/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 223, in log_metrics
    raise ValueError(m) from ex
ValueError:
 you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

@tgisaturday
Copy link
Author

tgisaturday commented Jul 1, 2021

@kaushikb11 I've resolved single tpu vm issue with the walkaround suggested in #8183. While everything is okay with single TPU VM, I'm still trying to solve logging issue with TPU VM Pod. With my revised taming-transformers-tpu code, the progress bar doesn't appear at all. Since the trainer itself works this seems to be progress bar logging issue with distributed training on tpu vm pod. Any suggestions on where to start looking with pytorch-lightning repo? Ping me on lightning slack if you need to.

@kaushikb11
Copy link
Contributor

kaushikb11 commented Jul 1, 2021

@tgisaturday Yup, I took a look into it. The issue is that the progress bar only appears after it's finished. It's not exactly Lightning but tqdm specific, but we definitely need to figure it out.

Here, you could see how pytorch_xla.distributed streams logs from different vms to the master worker.
https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_dist.py#L140

Guessing, it doesn't play well with tqdm.

Sample script to reproduce the issue & to fix the issue https://github.com/kaushikb11/minGPT/blob/master/tqdm_test.py

Would appreciate it if you could give it a look as well on how we could fix it.

@edenlightning edenlightning added this to the v1.3.x milestone Jul 1, 2021
@kaushikb11
Copy link
Contributor

Closing this issue, as it has been resolved by #8258 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants