New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Multi GPU Training Problem #38

Closed

yufeng9819 opened this issue Apr 5, 2023 · 2 comments

yufeng9819 commented Apr 5, 2023 •

edited

Loading

Hey! Thanks for your wonderful work again.@ZENGXH

But now I meet another problem. I want to know why the training process is unstable?

I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.

At the start of the training, the loss is decreasing. (decrease from 167)
`
2023-04-05 14:08:18.381 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 53/372] | [Loss] 167.43 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 53 | [url] none

2023-04-05 14:09:18.824 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[129/372] | [Loss] 88.55 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 129 | [url] none

2023-04-05 14:10:19.573 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[205/372] | [Loss] 62.51 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 205 | [url] none

2023-04-05 14:11:19.673 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[280/372] | [Loss] 49.84 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 280 | [url] none

2023-04-05 14:12:20.123 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[355/372] | [Loss] 42.31 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 355 | [url] none
`

However, the loss starts to increase when it decrease to around 14.

`
2023-04-05 14:13:33.545 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 71/372] | [Loss] 14.14 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 443 | [url] none

2023-04-05 14:14:34.110 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[147/372] | [Loss] 14.47 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 519 | [url] none

2023-04-05 14:15:34.777 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[223/372] | [Loss] 14.91 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 595 | [url] none

2023-04-05 14:16:35.255 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[299/372] | [Loss] 15.37 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 671 | [url] none

2023-04-05 14:17:32.903 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[371/372] | [Loss] 15.86 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 743 | [url] none | [time] 5.0m (~665h) |[best] 0 -100.000x1e-2

2023-04-05 14:18:32.966 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 70/372] | [Loss] 19.02 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 814 | [url] none

2023-04-05 14:19:33.599 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[144/372] | [Loss] 19.69 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 888 | [url] none

2023-04-05 14:20:34.311 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[217/372] | [Loss] 20.36 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 961 | [url] none

2023-04-05 14:21:34.365 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[290/372] | [Loss] 21.03 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1034 | [url] none

2023-04-05 14:22:35.093 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[364/372] | [Loss] 21.72 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1108 | [url] none

2023-04-05 14:22:41.203 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[371/372] | [Loss] 21.78 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1115 | [url] none | [time] 5.1m (~684h) |[best] 0 -100.000x1e-2

2023-04-05 14:23:41.649 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 72/372] | [Loss] 25.93 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1188 | [url] none

`

I want to know why the training process is so unstable and how to fix this problem.

Looking forward to your reply!

Collaborator

ZENGXH commented Apr 5, 2023 •

edited

Loading

This is log of my previous experiments on 55 classes:

It also have similar behavior: I think this is expected. In terms of the reason why the loss increase: you can check here

I think at the very early iteration, the KL weight is relatively small, the model will be more focus on optimizing the reconstruction loss, and the overall loss is decreasing. After some steps, the KL loss will dominate the overall loss, so the overall loss tend to increase.

Author

yufeng9819 commented Apr 5, 2023

I got it!

Thanks for your reply!

yufeng9819 closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment