Error occur in training diffusion #18

VLadImirluren · 2024-07-19T15:56:37Z

At first I try to train coarse VAE using the given command
python train.py ./configs/shapenet/chair/train_vae_16x16x16_dense.yaml --wname 16x16x16-kld-0.03_dim-16 --max_epochs 100 --cut_ratio 16 --gpus 1 --batch_size 16

Due to the GPU different (My gpu is one A800 but 8 * V100 said in paper), I change the bs to 16 and set gradient_accumulation to 2.

After successfully coarse VAE training, I try to train coarse diffusion using the given command (still only bs and gradient_accumulation be changed)
python train.py ./configs/shapenet/chair/train_diffusion_16x16x16_dense.yaml --wname 16x16x16_kld-0.03 --eval_interval 5 --gpus 1 --batch_size 8 --accumulate_grad_batches 32

But error occur!!!
2024-07-19 15:47:45.053 | INFO | main::171 - This is train_auto.py! Please note that you should use 300 instead of 300.0 for resuming.
git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

git root error: Cmd('git') failed due to: exit code(128)
cmdline: git rev-parse --show-toplevel
stderr: 'fatal: detected dubious ownership in repository at '/mnt/pfs/users/dengken/code/XCube'
To add an exception for this directory, call:

    git config --global --add safe.directory /mnt/pfs/users/dengken/code/XCube'

wandb: Currently logged in as: 13532152291 (13532152291-sun-yat-sen-university). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.3
wandb: Run data is saved locally in ../wandb/wandb/run-20240719_154747-rk4p0a77
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run chair_diffusion_dense/16x16x16_kld-0.03
wandb: ⭐️ View project at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: 🚀 View run at https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
[rank: 0] Global seed set to 0
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:258: LightningDeprecationWarning: pytorch_lightning.utilities.distributed.rank_zero_only has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from pytorch_lightning.utilities instead.
rank_zero_deprecation(
2024-07-19 15:48:01.165 | INFO | xcube.modules.autoencoding.sunet:init:240 - latent dim: 16
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
Traceback (most recent call last):
File "/mnt/pfs/users/dengken/code/XCube/train.py", line 380, in
net_model = net_module(model_args)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 84, in init
self.vae = self.load_first_stage_from_pretrained().eval()
File "/root/miniconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/pfs/users/dengken/code/XCube/xcube/models/diffusion.py", line 264, in load_first_stage_from_pretrained
return net_module.load_from_checkpoint(args_ckpt, hparams=model_args)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 139, in load_from_checkpoint
return _load_from_checkpoint(
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 188, in _load_from_checkpoint
return _load_state(cls, checkpoint, strict=strict, **kwargs)
File "/root/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 247, in _load_state
keys = obj.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Model:
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv1.Conv.weight: copying a param with shape torch.Size([64, 512, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 512, 3, 3, 3]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.GroupNorm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
size mismatch for unet.pre_kl_bottleneck.pre_kl_bottleneck_1.SingleConv2.Conv.weight: copying a param with shape torch.Size([64, 64, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 32, 3, 3, 3]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.GroupNorm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
size mismatch for unet.post_kl_bottleneck.post_kl_bottleneck_0.SingleConv1.Conv.weight: copying a param with shape torch.Size([512, 32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 16, 3, 3, 3]).
wandb: 🚀 View run chair_diffusion_dense/16x16x16_kld-0.03 at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet/runs/rk4p0a77
wandb: ⭐️ View project at: https://wandb.ai/13532152291-sun-yat-sen-university/xcube-shapenet
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ../wandb/wandb/run-20240719_154747-rk4p0a77/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.

And their is no error using the ckpt download from your VAE

Could you please help me?Thanks

The text was updated successfully, but these errors were encountered:

VLadImirluren · 2024-07-19T16:08:25Z

That's not a problem with git error, I solved the git error but the error following still exist

VLadImirluren · 2024-07-22T02:13:19Z

In another word, after using the VAE training command you gave, executing the diffusion training you gave will result in an error

tanghaotommy · 2024-07-22T17:45:07Z

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

VLadImirluren · 2024-07-23T01:46:53Z

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

xrenaa · 2024-07-23T05:47:56Z

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

VLadImirluren · 2024-07-27T04:45:38Z

You might want to change the cut_ratio when training the diffusion model. The command you gave was using cut_ratio=16 for training the VAE. The default value for training the diffusion is 32, you might want to change that to 16 as well.

But the latent dimension gave by paper is 16. The performance whether will be significantly affected by different latent dimension or not?

Hi, thanks for your trying! 16 or 8 does not have a big differece. I will fix some instructions.

Thanks

VLadImirluren closed this as completed Jul 19, 2024

VLadImirluren reopened this Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occur in training diffusion #18

Error occur in training diffusion #18

VLadImirluren commented Jul 19, 2024 •

edited

Loading

VLadImirluren commented Jul 19, 2024 •

edited

Loading

VLadImirluren commented Jul 22, 2024

tanghaotommy commented Jul 22, 2024 •

edited

Loading

VLadImirluren commented Jul 23, 2024

xrenaa commented Jul 23, 2024 •

edited

Loading

VLadImirluren commented Jul 27, 2024

Error occur in training diffusion #18

Error occur in training diffusion #18

Comments

VLadImirluren commented Jul 19, 2024 • edited Loading

VLadImirluren commented Jul 19, 2024 • edited Loading

VLadImirluren commented Jul 22, 2024

tanghaotommy commented Jul 22, 2024 • edited Loading

VLadImirluren commented Jul 23, 2024

xrenaa commented Jul 23, 2024 • edited Loading

VLadImirluren commented Jul 27, 2024

VLadImirluren commented Jul 19, 2024 •

edited

Loading

VLadImirluren commented Jul 19, 2024 •

edited

Loading

tanghaotommy commented Jul 22, 2024 •

edited

Loading

xrenaa commented Jul 23, 2024 •

edited

Loading