You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But now I meet another problem. I want to know why the training process is unstable?
I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.
At the start of the training, the loss is decreasing. (decrease from 167)
`
2023-04-05 14:08:18.381 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 53/372] | [Loss] 167.43 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 53 | [url] none
This is log of my previous experiments on 55 classes:
It also have similar behavior: I think this is expected. In terms of the reason why the loss increase: you can check here
I think at the very early iteration, the KL weight is relatively small, the model will be more focus on optimizing the reconstruction loss, and the overall loss is decreasing. After some steps, the KL loss will dominate the overall loss, so the overall loss tend to increase.
Hey! Thanks for your wonderful work again.@ZENGXH
But now I meet another problem. I want to know why the training process is unstable?
I train VAE model on all categories (bash ./scripts/train_vae_all.sh)with batchsize of 12 on 8 V100 16GB.
At the start of the training, the loss is decreasing. (decrease from 167)
`
2023-04-05 14:08:18.381 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[ 53/372] | [Loss] 167.43 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 53 | [url] none
2023-04-05 14:09:18.824 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[129/372] | [Loss] 88.55 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 129 | [url] none
2023-04-05 14:10:19.573 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[205/372] | [Loss] 62.51 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 205 | [url] none
2023-04-05 14:11:19.673 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[280/372] | [Loss] 49.84 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 280 | [url] none
2023-04-05 14:12:20.123 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E0 iter[355/372] | [Loss] 42.31 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 355 | [url] none
`
However, the loss starts to increase when it decrease to around 14.
`
2023-04-05 14:13:33.545 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[ 71/372] | [Loss] 14.14 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 443 | [url] none
2023-04-05 14:14:34.110 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[147/372] | [Loss] 14.47 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 519 | [url] none
2023-04-05 14:15:34.777 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[223/372] | [Loss] 14.91 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 595 | [url] none
2023-04-05 14:16:35.255 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E1 iter[299/372] | [Loss] 15.37 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 671 | [url] none
2023-04-05 14:17:32.903 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E1 iter[371/372] | [Loss] 15.86 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 743 | [url] none | [time] 5.0m (~665h) |[best] 0 -100.000x1e-2
2023-04-05 14:18:32.966 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[ 70/372] | [Loss] 19.02 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 814 | [url] none
2023-04-05 14:19:33.599 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[144/372] | [Loss] 19.69 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 888 | [url] none
2023-04-05 14:20:34.311 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[217/372] | [Loss] 20.36 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 961 | [url] none
2023-04-05 14:21:34.365 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[290/372] | [Loss] 21.03 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1034 | [url] none
2023-04-05 14:22:35.093 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E2 iter[364/372] | [Loss] 21.72 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1108 | [url] none
2023-04-05 14:22:41.203 | INFO | trainers.base_trainer:train_epochs:256 - [R0] | E2 iter[371/372] | [Loss] 21.78 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1115 | [url] none | [time] 5.1m (~684h) |[best] 0 -100.000x1e-2
2023-04-05 14:23:41.649 | INFO | trainers.base_trainer:train_epochs:219 - [R0] | E3 iter[ 72/372] | [Loss] 25.93 | [exp] ../exp/0405/all/7d8f96h_hvae_lion_B12 | [step] 1188 | [url] none
`
I want to know why the training process is so unstable and how to fix this problem.
Looking forward to your reply!
The text was updated successfully, but these errors were encountered: