Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occur in training diffusion #18

Open
VLadImirluren opened this issue Jul 19, 2024 · 6 comments
Open

Error occur in training diffusion #18

VLadImirluren opened this issue Jul 19, 2024 · 6 comments

Comments

@VLadImirluren
Copy link

VLadImirluren commented Jul 19, 2024

At first I try to train coarse VAE using the given command
python train.py ./configs/shapenet/chair/train_vae_16x16x16_dense.yaml --wname 16x16x16-kld-0.03_dim-16 --max_epochs 100 --cut_ratio 16 --gpus 1 --batch_size 16

Due to the GPU different (My gpu is one A800 but 8 * V100 said in paper), I change the bs to 16 and set gradient_accumulation to 2.

After successfully coarse VAE training, I try to train coarse diffusion using the given command (still only bs and gradient_accumulation be changed)
python train.py ./configs/shapenet/chair/train_diffusion_16x16x16_dense.yaml --wname 16x16x16_kld-0.03 --eval_interval 5 --gpus 1 --batch_size 8 --accumulate_grad_batches 32

But error occur!!!
2024-07-19 15:47:45.053 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240719_154747-rk4p0a77
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run chair_diffusion_dense/16x16x16_kld-0.03
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:258: LightningDeprecationWarning: pytorch_lightning.utilities.distributed.rank_zero_only has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from pytorch_lightning.utilities instead.
rank_zero_deprecation(
2024-07-19 15:48:01.165 | INFO | xcube.modules.autoencoding.sunet:init:240 - latent dim: 16
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
wandb: 🚀 View run chair_diffusion_dense/16x16x16_kld-0.03 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240719_154747-rk4p0a77/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

And their is no error using the ckpt download from your VAE

Could you please help me?Thanks

@VLadImirluren
Copy link
Author

VLadImirluren commented Jul 19, 2024

That's not a problem with git error, I solved the git error but the error following still exist

@VLadImirluren
Copy link
Author

In another word, after using the VAE training command you gave, executing the diffusion training you gave will result in an error

@tanghaotommy
Copy link

tanghaotommy commented Jul 22, 2024

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

@VLadImirluren
Copy link
Author

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

@xrenaa
Copy link
Collaborator

xrenaa commented Jul 23, 2024

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

@VLadImirluren
Copy link
Author

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants