You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can run the rest of the methods fine. I can run classifier_sample.py, super_res_sample.py but when I tried to run classifier_train.py I got a runtime error.
...\torch\distributed\distributed_c10d.py in broadcast(tensor, src, group, async_op)
1027 return work
1028 else:
-> 1029 work.wait()
1030
1031
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
This is the argument I used and the training commands:
@jaketalyor32325 I was able to comment out dist_util.sync_params(self.model.parameters()) from train_util.py and get the training to run. I assume the parameters really only need to be sync'd across multiple-gpus.
Settings:
Win10 Pro
python 3.7.9
ptorch 1.8.1+cu111
1 GPU
GLOO backend
jupyter notebook
I can run the rest of the methods fine. I can run classifier_sample.py, super_res_sample.py but when I tried to run classifier_train.py I got a runtime error.
...\torch\distributed\distributed_c10d.py in broadcast(tensor, src, group, async_op)
1027 return work
1028 else:
-> 1029 work.wait()
1030
1031
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
This is the argument I used and the training commands:
TRAIN_FLAGS="--iterations 300000 --anneal_lr True --batch_size 256 --lr 3e-4 --save_interval 10000 --weight_decay 0.05"
CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"
%run scripts/classifier_train.py --data_dir r"G:\data_set\imagenette2-160\train" $TRAIN_FLAGS $CLASSIFIER_FLAGS
Thanks for any comments and assistance in advance.
The text was updated successfully, but these errors were encountered: