Logging issue on TPU VM Pod #7912

tgisaturday · 2021-06-10T06:16:14Z

🐛 Bug

Please reproduce using the BoringModel

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer.
While running code on Google Cloud TPU VM Pod v3-8 successfully runs,
process crashes on Google Cloud TPU VM Pod v3-32 (not Pod Node).

To Reproduce

Modified BoringModel.ipynb to .py, add tpu_cores=8 to Trainer (for TPU support).

Expected behavior

Run without crash on v3-32.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

TPU VM Pod Software: v2-alpha

PyTorch Version (e.g., 1.0): 1.8.1
OS (e.g., Linux): Ubuntu
How you installed PyTorch (conda, pip, source): bulit-in image in v2-alpha

Additional context

I've been also testing simple MNIST GAN code and same problem appears. My custom code crashes when Trainer.fit() automatically tries to save checkpoints with trainer.save_checkpoint.
Here are test codes that I've used.
testcode.zip

The text was updated successfully, but these errors were encountered:

kaushikb11 · 2021-06-10T12:03:24Z

Hey there! As you are trying to run on a TPU Pod, you would need to run

python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

tgisaturday · 2021-06-10T12:57:48Z

Hey there! As you are trying to run on a TPU Pod, you would need to run
python -m torch_xla.distributed.xla_dist --tpu=$TPU_NAME -- python script.py

@kaushikb11 I've been running the code in distributed mode. This doesn't help.

kaushikb11 · 2021-06-10T17:09:41Z

@tgisaturday Could you provide more details? Lightning Version? Minimal example to reproduce the issue?

kaushikb11 · 2021-06-10T17:09:57Z

And also, where it seems to be failing.

tgisaturday · 2021-06-10T23:50:34Z

@kaushikb11 Here are test codes that I'm using: testcode.zip

I'm using pytorch-lightning 1.3.5.

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name  -- python3 gan_test_pod.py 

python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python3 boring.py

I'm not sure where the boring.py fails, but my personal gan code seems to fail when the Trainer automatically tries to save checkpoints(trainer.save_checkpoint.py).

kaushikb11 · 2021-06-11T00:11:19Z

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

tgisaturday · 2021-06-11T00:36:17Z

@kaushikb11 Have you ever tried with TPU VM v3-32? Boring script keeps throwing the error.

2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 48 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. 2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs) 2021-06-11 00:34:32 10.164.0.7 [0] /usr/local/lib/python3.8/dist-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers argument (try 48 which is the number of cpus on this machine) in the DataLoader init to improve performance.
2021-06-11 00:34:32 10.164.0.7 [0] warnings.warn(*args, **kwargs)
Epoch 0: 100%|██████████| 2/2 [00:00<00:00, 15.80it/s, loss=1.79, v_num=0]
2021-06-11 00:34:32 10.164.0.13 [3] Exception in device=TPU:24: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] Exception in device=TPU:8: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] Exception in device=TPU:16: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.22 [1] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.8 [2] Traceback (most recent call last):
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
2021-06-11 00:34:32 10.164.0.13 [3] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.22 [1] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.8 [2] _start_fn(index, pf_cfg, fn, args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
2021-06-11 00:34:32 10.164.0.13 [3] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.22 [1] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.8 [2] fn(gindex, *args)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 164, in new_process
2021-06-11 00:34:32 10.164.0.13 [3] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.22 [1] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.8 [2] results = trainer.run_stage()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
2021-06-11 00:34:32 10.164.0.13 [3] return self.run_train()
2021-06-11 00:34:32 10.164.0.22 [1] return self.run_train()
2021-06-11 00:34:32 10.164.0.8 [2] return self.run_train()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
2021-06-11 00:34:32 10.164.0.13 [3] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.22 [1] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.8 [2] self.train_loop.run_training_epoch()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 584, in run_training_epoch
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.run_evaluation(on_epoch=True)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1006, in run_evaluation
2021-06-11 00:34:32 10.164.0.13 [3] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.22 [1] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.8 [2] self.evaluation_loop.on_evaluation_end()
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/evaluation_loop.py", line 102, in on_evaluation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] self.trainer.call_hook('on_validation_end', *args, **kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1223, in call_hook
2021-06-11 00:34:32 10.164.0.13 [3] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.22 [1] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.8 [2] trainer_hook(*args, **kwargs)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 227, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.22 [1] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.8 [2] callback.on_validation_end(self, self.lightning_module)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 249, in on_validation_end
2021-06-11 00:34:32 10.164.0.13 [3] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.22 [1] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.8 [2] self.save_checkpoint(trainer)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 300, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_none_monitor_checkpoint(trainer, monitor_candidates)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 680, in _save_none_monitor_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._save_model(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 449, in _save_model
2021-06-11 00:34:32 10.164.0.13 [3] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.22 [1] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.8 [2] self._do_save(trainer, filepath)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 460, in _do_save
2021-06-11 00:34:32 10.164.0.13 [3] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] trainer.save_checkpoint(filepath, self.save_weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.22 [1] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.8 [2] File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/properties.py", line 330, in save_checkpoint
2021-06-11 00:34:32 10.164.0.13 [3] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.22 [1] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.8 [2] self.checkpoint_connector.save_checkpoint(filepath, weights_only)
2021-06-11 00:34:32 10.164.0.13 [3] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.22 [1] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.8 [2] FileNotFoundError: [Errno 2] No such file or directory: '/home/taehoon.kim/lightning_logs/version_0/checkpoints/epoch=0-step=0.ckpt'
2021-06-11 00:34:32 10.164.0.13 [3] https://symbolize.stripped_domain/r/?trace=7f358ea885ce,7f358e9ac20f,7f33a145ee81,7f3396fa9692,7f3396f984ea,7f3396f38b4b,7f35362c7e4a,515bd6f,2&map=c5ea6dcea9ec73900e238cf37efee14d75fd7749:7f339433b000-7f33a3ca9e28
2021-06-11 00:34:32 10.164.0.13 [3] *** SIGTERM received by PID 10397 (TID 10397) on cpu 15 from PID 9809; stack trace: ...

tgisaturday · 2021-06-11T00:37:57Z

@tgisaturday

It shoulld be python3 -m torch_xla.distributed.xla_dist --tpu=tpu-name -- python boring.py

Boring script should be working.

@kaushikb11 There was some typo in running command for boring script.

kaushikb11 · 2021-06-11T00:42:14Z

@tgisaturday Could you try the Lightning master?

tgisaturday · 2021-06-11T00:43:48Z

@kaushikb11 I'll try right away. I found out that boring script successfully runs with 'checkpoint_callback=False' flag.

kaushikb11 · 2021-06-11T01:13:10Z

@tgisaturday Awesome! It should be resolved. Also, if you face any more issues, feel free to ping me on Lightning Slack!

tgisaturday · 2021-06-11T01:19:54Z

@kaushikb11 Using lightning master (1.4.0dev) and saving checkpoint keeps throwing errors...

I've first tried to save checkpoints in current working directory. It throws 'file exists error'
Next, I've tried to save checkpoints in gcs bucket. It also throws 'file exists error'
I've assigned different root dir for each TPU worker by using os.environ["CLOUD_TPU_TASK_ID"] and 'file exists error' was resolved but the process still crashes with 'socket closed(14)'

I guess there's some problems with DDP accelerator when combined with TPU VM. I'm not sure if this is internal TPU VM problem or pytorch-lightning problem.

kaushikb11 · 2021-06-11T02:05:13Z

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

tgisaturday · 2021-06-11T02:22:18Z

@tgisaturday I had recently trained minGPT on TPU VM Pod, and it worked as expected.

It throws 'file exists error'

Could you try deleting the logs and train it again? I had recently fixed a logging issue for GCS bucket.

Also, I am taking a stab at this issue in some time. Will update you regarding this. We will resolve this!! :)

@kaushikb11 Thank you for spending your time regarding this issue. I'll try training minGPT also.

tgisaturday · 2021-06-11T03:56:00Z

@kaushikb11 I removed logging[self.log(..)] from boring script and save_checkpoint works!

It's seems that logging is causing the problem.

kaushikb11 · 2021-06-11T05:39:42Z

@tgisaturday What were you logging?

tgisaturday · 2021-06-11T06:09:38Z

@tgisaturday What were you logging?

commented out every self.log from boring script.

tgisaturday · 2021-06-21T01:49:10Z

@kaushikb11 I've been refactoring taming-transformers to run the code on TPU VM.

Here's my code.
taming-transformers-tpu

For easier debugging, I've also added fake_data feature. To start training with fake data, run the code by:

pip install -r requirements.txt

python main.py --use_tpus --fake_data

The code is working properly on single TPU Node or GPUs but seems to fall into deadlock on the initial stage of training on TPU VM.

76.7 M Trainable params
14.7 M Non-trainable params
91.5 M Total params
182.917 Total estimated model params size (MB)
Epoch 0: 0%| | 0/456 [00:00<?, ?it/s]

Nothing goes further from here. Got any comments or suggestions? I'm not sure if this is TPU VM's internal problem or lightning's.

kaushikb11 · 2021-06-21T05:22:20Z

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

tgisaturday · 2021-06-21T05:32:24Z

@tgisaturday What do you mean by single TPU Node here? Single TPU core or 8 TPU cores? Also, have you tried debugging at what point it goes into a deadlock?

Regarding single TPU Node, I've meant older way of using 8 core TPU (assign CPU vm and pair with TPU), not using newly released TPU VM.

Is there any way I can debug my code deeper than Trainer.fit? When I press ctrl+c, my code get's interupted somewhere around tpu spawn.

kaushikb11 · 2021-06-21T05:41:07Z

@tgisaturday Got it! Let me give it a try today.

tgisaturday · 2021-06-21T05:55:45Z

@tgisaturday Got it! Let me give it a try today.

I'm also closely interacting with GCP- side engineers. Please let me know if it is out of lightning's scope.

kaushikb11 · 2021-06-21T14:53:03Z

Going through your script. Also, note the effective batch size is batch size * 8 for 8 cores.

Getting this error when I did the first script run

  File "/home/kaushikbokka/.local/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 223, in log_metrics
    raise ValueError(m) from ex
ValueError:
 you tried to log -1 which is not currently supported. Try a dict or a scalar/tensor.

tgisaturday · 2021-07-01T11:12:27Z

@kaushikb11 I've resolved single tpu vm issue with the walkaround suggested in #8183. While everything is okay with single TPU VM, I'm still trying to solve logging issue with TPU VM Pod. With my revised taming-transformers-tpu code, the progress bar doesn't appear at all. Since the trainer itself works this seems to be progress bar logging issue with distributed training on tpu vm pod. Any suggestions on where to start looking with pytorch-lightning repo? Ping me on lightning slack if you need to.

kaushikb11 · 2021-07-01T11:21:12Z

@tgisaturday Yup, I took a look into it. The issue is that the progress bar only appears after it's finished. It's not exactly Lightning but tqdm specific, but we definitely need to figure it out.

Here, you could see how pytorch_xla.distributed streams logs from different vms to the master worker.
https://github.com/pytorch/xla/blob/master/torch_xla/distributed/xla_dist.py#L140

Guessing, it doesn't play well with tqdm.

Sample script to reproduce the issue & to fix the issue https://github.com/kaushikb11/minGPT/blob/master/tqdm_test.py

Would appreciate it if you could give it a look as well on how we could fix it.

kaushikb11 · 2021-07-05T12:06:02Z

Closing this issue, as it has been resolved by #8258 :)

tgisaturday added bug Something isn't working help wanted Open to be worked on labels Jun 10, 2021

tgisaturday changed the title ~~trainer.save_checkpoint fails in TPU VM Pod~~ Trainer automatic checkpoint saving fails on TPU VM Pod Jun 10, 2021

tgisaturday changed the title ~~Trainer automatic checkpoint saving fails on TPU VM Pod~~ Trainer.fit() fails on TPU VM Pod Jun 10, 2021

kaushikb11 self-assigned this Jun 10, 2021

kaushikb11 changed the title ~~Trainer.fit() fails on TPU VM Pod~~ Logging issue on TPU VM Pod Jun 11, 2021

edenlightning added this to the v1.3.x milestone Jul 1, 2021

kaushikb11 closed this as completed Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging issue on TPU VM Pod #7912

Logging issue on TPU VM Pod #7912

tgisaturday commented Jun 10, 2021

kaushikb11 commented Jun 10, 2021 •

edited

Loading

tgisaturday commented Jun 10, 2021 •

edited

Loading

kaushikb11 commented Jun 10, 2021

kaushikb11 commented Jun 10, 2021

tgisaturday commented Jun 10, 2021 •

edited

Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 •

edited

Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 •

edited

Loading

kaushikb11 commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021

tgisaturday commented Jun 21, 2021 •

edited

Loading

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jun 21, 2021

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jun 21, 2021

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jul 1, 2021 •

edited

Loading

kaushikb11 commented Jul 1, 2021 •

edited

Loading

kaushikb11 commented Jul 5, 2021

Logging issue on TPU VM Pod #7912

Logging issue on TPU VM Pod #7912

Comments

tgisaturday commented Jun 10, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

kaushikb11 commented Jun 10, 2021 • edited Loading

tgisaturday commented Jun 10, 2021 • edited Loading

kaushikb11 commented Jun 10, 2021

kaushikb11 commented Jun 10, 2021

tgisaturday commented Jun 10, 2021 • edited Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 • edited Loading

tgisaturday commented Jun 11, 2021 • edited Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 • edited Loading

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021 • edited Loading

kaushikb11 commented Jun 11, 2021 • edited Loading

tgisaturday commented Jun 11, 2021 • edited Loading

tgisaturday commented Jun 11, 2021

kaushikb11 commented Jun 11, 2021

tgisaturday commented Jun 11, 2021

tgisaturday commented Jun 21, 2021 • edited Loading

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jun 21, 2021

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jun 21, 2021

kaushikb11 commented Jun 21, 2021

tgisaturday commented Jul 1, 2021 • edited Loading

kaushikb11 commented Jul 1, 2021 • edited Loading

kaushikb11 commented Jul 5, 2021

kaushikb11 commented Jun 10, 2021 •

edited

Loading

tgisaturday commented Jun 10, 2021 •

edited

Loading

tgisaturday commented Jun 10, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

kaushikb11 commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 11, 2021 •

edited

Loading

tgisaturday commented Jun 21, 2021 •

edited

Loading

tgisaturday commented Jul 1, 2021 •

edited

Loading

kaushikb11 commented Jul 1, 2021 •

edited

Loading