Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU Training Problem #38

Closed
yufeng9819 opened this issue Apr 5, 2023 · 2 comments
Closed

Multi GPU Training Problem #38

yufeng9819 opened this issue Apr 5, 2023 · 2 comments

Comments

@yufeng9819
Copy link

yufeng9819 commented Apr 5, 2023

Hey! Thanks for your wonderful work again.@ZENGXH

But now I meet another problem. I want to know why the training process is unstable?

I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.

At the start of the training, the loss is decreasing. (decrease from 167)
`
2023-04-05 14:08:18.381 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 53/372] | [Loss] 167.43 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 53 | [url] none

2023-04-05 14:09:18.824 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[129/372] | [Loss] 88.55 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 129 | [url] none

2023-04-05 14:10:19.573 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[205/372] | [Loss] 62.51 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 205 | [url] none

2023-04-05 14:11:19.673 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[280/372] | [Loss] 49.84 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 280 | [url] none

2023-04-05 14:12:20.123 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[355/372] | [Loss] 42.31 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 355 | [url] none
`

However, the loss starts to increase when it decrease to around 14.

`
2023-04-05 14:13:33.545 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 71/372] | [Loss] 14.14 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 443 | [url] none

2023-04-05 14:14:34.110 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[147/372] | [Loss] 14.47 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 519 | [url] none

2023-04-05 14:15:34.777 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[223/372] | [Loss] 14.91 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 595 | [url] none

2023-04-05 14:16:35.255 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[299/372] | [Loss] 15.37 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 671 | [url] none

2023-04-05 14:17:32.903 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[371/372] | [Loss] 15.86 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 743 | [url] none | [time] 5.0m (~665h) |[best] 0 -100.000x1e-2

2023-04-05 14:18:32.966 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 70/372] | [Loss] 19.02 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 814 | [url] none

2023-04-05 14:19:33.599 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[144/372] | [Loss] 19.69 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 888 | [url] none

2023-04-05 14:20:34.311 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[217/372] | [Loss] 20.36 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 961 | [url] none

2023-04-05 14:21:34.365 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[290/372] | [Loss] 21.03 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1034 | [url] none

2023-04-05 14:22:35.093 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[364/372] | [Loss] 21.72 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1108 | [url] none

2023-04-05 14:22:41.203 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[371/372] | [Loss] 21.78 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1115 | [url] none | [time] 5.1m (~684h) |[best] 0 -100.000x1e-2

2023-04-05 14:23:41.649 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 72/372] | [Loss] 25.93 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1188 | [url] none

`

I want to know why the training process is so unstable and how to fix this problem.

Looking forward to your reply!

@ZENGXH
Copy link
Collaborator

ZENGXH commented Apr 5, 2023

This is log of my previous experiments on 55 classes:
Image description

It also have similar behavior: I think this is expected. In terms of the reason why the loss increase: you can check here

I think at the very early iteration, the KL weight is relatively small, the model will be more focus on optimizing the reconstruction loss, and the overall loss is decreasing. After some steps, the KL loss will dominate the overall loss, so the overall loss tend to increase.

@yufeng9819
Copy link
Author

I got it!

Thanks for your reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants