swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

woongjoonchoi · 2025-01-15T07:44:02Z

when train with swin-transformer-v2 , RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) happend.

i fixed code
models/swin_transformer_v2.py line 159

Before :
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01))).exp()
After :
logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01).cuda())).exp()

This is how to reproduce error .

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval --cfg ./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml --resume ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth --data-path imagenet
WARNING: CPU IP/backtrace sampling not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CPU context switch tracing not supported, disabling.
Try the 'nsys status --environment' command to learn more.

WARNING: CUDA backtraces will not be collected because CPU sampling is disabled.
/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py:208: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

main()
Tutel has not been installed. To use Swin-MoE, please install Tutel; otherwise, just ignore this.
To use FusedLAMB or FusedAdam, please install apex.
=> merge config from ./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml
RANK and WORLD_SIZE in environ: 0/1
[rank0]:[W115 16:39:15.575750744 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 434): INFO Full config saved to output/swinv2_tiny_patch4_window8_256/default/config.json
[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 437): INFO AMP_ENABLE: true
AMP_OPT_LEVEL: ''
AUG:
AUTO_AUGMENT: rand-m9-mstd0.5-inc1
COLOR_JITTER: 0.4
CUTMIX: 1.0
CUTMIX_MINMAX: null
MIXUP: 0.8
MIXUP_MODE: batch
MIXUP_PROB: 1.0
MIXUP_SWITCH_PROB: 0.5
RECOUNT: 1
REMODE: pixel
REPROB: 0.25
BASE:

''
DATA:
BATCH_SIZE: 128
CACHE_MODE: part
DATASET: imagenet
DATA_PATH: imagenet
IMG_SIZE: 256
INTERPOLATION: bicubic
MASK_PATCH_SIZE: 32
MASK_RATIO: 0.6
NUM_WORKERS: 8
PIN_MEMORY: true
ZIP_MODE: false
ENABLE_AMP: false
EVAL_MODE: true
FUSED_LAYERNORM: false
FUSED_WINDOW_PROCESS: false
LOCAL_RANK: 0
MODEL:
DROP_PATH_RATE: 0.2
DROP_RATE: 0.0
LABEL_SMOOTHING: 0.1
NAME: swinv2_tiny_patch4_window8_256
NUM_CLASSES: 1000
PRETRAINED: ''
RESUME: ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth
SIMMIM:
NORM_TARGET:
ENABLE: false
PATCH_SIZE: 47
SWIN:
APE: false
DEPTHS:
- 2
- 2
- 6
- 2
  EMBED_DIM: 96
  IN_CHANS: 3
  MLP_RATIO: 4.0
  NUM_HEADS:
- 3
- 6
- 12
- 24
  PATCH_NORM: true
  PATCH_SIZE: 4
  QKV_BIAS: true
  QK_SCALE: null
  WINDOW_SIZE: 7
  SWINV2:
  APE: false
  DEPTHS:
- 2
- 2
- 6
- 2
  EMBED_DIM: 96
  IN_CHANS: 3
  MLP_RATIO: 4.0
  NUM_HEADS:
- 3
- 6
- 12
- 24
  PATCH_NORM: true
  PATCH_SIZE: 4
  PRETRAINED_WINDOW_SIZES:
- 0
- 0
- 0
- 0
  QKV_BIAS: true
  WINDOW_SIZE: 8
  SWIN_MLP:
  APE: false
  DEPTHS:
- 2
- 2
- 6
- 2
  EMBED_DIM: 96
  IN_CHANS: 3
  MLP_RATIO: 4.0
  NUM_HEADS:
- 3
- 6
- 12
- 24
  PATCH_NORM: true
  PATCH_SIZE: 4
  WINDOW_SIZE: 7
  SWIN_MOE:
  APE: false
  AUX_LOSS_WEIGHT: 0.01
  CAPACITY_FACTOR: 1.25
  COSINE_ROUTER: false
  COSINE_ROUTER_DIM: 256
  COSINE_ROUTER_INIT_T: 0.5
  DEPTHS:
- 2
- 2
- 6
- 2
  EMBED_DIM: 96
  GATE_NOISE: 1.0
  INIT_STD: 0.02
  IN_CHANS: 3
  IS_GSHARD_LOSS: false
  MLP_FC2_BIAS: true
  MLP_RATIO: 4.0
  MOE_BLOCKS:
- - -1
- - -1
- - -1
- - -1
    MOE_DROP: 0.0
    NORMALIZE_GATE: false
    NUM_HEADS:
- 3
- 6
- 12
- 24
  NUM_LOCAL_EXPERTS: 1
  PATCH_NORM: true
  PATCH_SIZE: 4
  PRETRAINED_WINDOW_SIZES:
- 0
- 0
- 0
- 0
  QKV_BIAS: true
  QK_SCALE: null
  TOP_VALUE: 1
  USE_BPR: true
  WINDOW_SIZE: 7
  TYPE: swinv2
  OUTPUT: output/swinv2_tiny_patch4_window8_256/default
  PRINT_FREQ: 10
  SAVE_FREQ: 1
  SEED: 0
  TAG: default
  TEST:
  CROP: true
  SEQUENTIAL: false
  SHUFFLE: false
  THROUGHPUT_MODE: false
  TRAIN:
  ACCUMULATION_STEPS: 1
  AUTO_RESUME: true
  BASE_LR: 0.000125
  CLIP_GRAD: 5.0
  EPOCHS: 300
  LAYER_DECAY: 1.0
  LR_SCHEDULER:
  DECAY_EPOCHS: 30
  DECAY_RATE: 0.1
  GAMMA: 0.1
  MULTISTEPS: []
  NAME: cosine
  WARMUP_PREFIX: true
  MIN_LR: 1.25e-06
  MOE:
  SAVE_MASTER: false
  OPTIMIZER:
  BETAS:
- 0.9
- 0.999
  EPS: 1.0e-08
  MOMENTUM: 0.9
  NAME: adamw
  START_EPOCH: 0
  USE_CHECKPOINT: false
  WARMUP_EPOCHS: 20
  WARMUP_LR: 1.25e-07
  WEIGHT_DECAY: 0.05

[2025-01-15 16:39:15 swinv2_tiny_patch4_window8_256](main.py 438): INFO {"cfg": "./configs/swinv2/swinv2_tiny_patch4_window8_256.yaml", "opts": null, "batch_size": null, "data_path": "imagenet", "zip": false, "cache_mode": "part", "pretrained": null, "resume": "././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth", "accumulation_steps": null, "use_checkpoint": false, "disable_amp": false, "amp_opt_level": null, "output": "output", "tag": null, "eval": true, "throughput": false, "fused_window_process": false, "fused_layernorm": false, "optim": null}
local rank 0 / global rank 0 successfully build train dataset
local rank 0 / global rank 0 successfully build val dataset
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 93): INFO Creating model:swinv2/swinv2_tiny_patch4_window8_256
/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/functional.py:534: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3595.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 95): INFO SwinTransformerV2(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): ModuleList(
(0): BasicLayer(
dim=96, input_resolution=(64, 64), depth=2
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=96, input_resolution=(64, 64), num_heads=3, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=96, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=3
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=3, bias=False)
)
(qkv): Linear(in_features=96, out_features=288, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): Identity()
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=96, input_resolution=(64, 64), num_heads=3, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=96, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=3
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=3, bias=False)
)
(qkv): Linear(in_features=96, out_features=288, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=96, out_features=96, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((96,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=96, out_features=384, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=384, out_features=96, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(64, 64), dim=96
(reduction): Linear(in_features=384, out_features=192, bias=False)
(norm): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
)
)
(1): BasicLayer(
dim=192, input_resolution=(32, 32), depth=2
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=192, input_resolution=(32, 32), num_heads=6, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=192, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=6
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=6, bias=False)
)
(qkv): Linear(in_features=192, out_features=576, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=192, input_resolution=(32, 32), num_heads=6, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=192, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=6
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=6, bias=False)
)
(qkv): Linear(in_features=192, out_features=576, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=192, out_features=192, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=192, out_features=768, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=768, out_features=192, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(32, 32), dim=192
(reduction): Linear(in_features=768, out_features=384, bias=False)
(norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
)
)
(2): BasicLayer(
dim=384, input_resolution=(16, 16), depth=6
(blocks): ModuleList(
(0): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(1): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(2): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(3): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(4): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
(5): SwinTransformerBlock(
dim=384, input_resolution=(16, 16), num_heads=12, window_size=8, shift_size=4, mlp_ratio=4.0
(norm1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=384, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=12
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=12, bias=False)
)
(qkv): Linear(in_features=384, out_features=1152, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=384, out_features=384, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=384, out_features=1536, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=1536, out_features=384, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
(downsample): PatchMerging(
input_resolution=(16, 16), dim=384
(reduction): Linear(in_features=1536, out_features=768, bias=False)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(3): BasicLayer(
dim=768, input_resolution=(8, 8), depth=2
(blocks): ModuleList(
(0-1): 2 x SwinTransformerBlock(
dim=768, input_resolution=(8, 8), num_heads=24, window_size=8, shift_size=0, mlp_ratio=4.0
(norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): WindowAttention(
dim=768, window_size=(8, 8), pretrained_window_size=(0, 0), num_heads=24
(cpb_mlp): Sequential(
(0): Linear(in_features=2, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=24, bias=False)
)
(qkv): Linear(in_features=768, out_features=2304, bias=False)
(attn_drop): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=768, bias=True)
(proj_drop): Dropout(p=0.0, inplace=False)
(softmax): Softmax(dim=-1)
)
(drop_path): DropPath()
(norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): Mlp(
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(act): GELU(approximate='none')
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(drop): Dropout(p=0.0, inplace=False)
)
)
)
)
)
(norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(avgpool): AdaptiveAvgPool1d(output_size=1)
(head): Linear(in_features=768, out_features=1000, bias=True)
)
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 98): INFO number of params: 28347154
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 101): INFO number of GFLOPs: 5.925697536
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/utils.py:203: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self._scaler = torch.cuda.amp.GradScaler()
All checkpoints founded in output/swinv2_tiny_patch4_window8_256/default: []
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](main.py 151): INFO no checkpoint found in output/swinv2_tiny_patch4_window8_256/default, ignoring auto resume
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](utils.py 19): INFO ==============> Resuming form ././Swin-model-1k/swinv2/swinv2_tiny_patch4_window8_256.pth....................
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/utils.py:24: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(config.MODEL.RESUME, map_location='cpu')
[2025-01-15 16:39:17 swinv2_tiny_patch4_window8_256](utils.py 26): INFO
/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py:308: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast(enabled=config.AMP_ENABLE):
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 440, in
[rank0]: main(config)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 155, in main
[rank0]: acc1, acc5, loss = validate(config, data_loader_val, model)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/main.py", line 314, in validate
[rank0]: output = model(images)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 627, in forward
[rank0]: x = self.forward_features(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 619, in forward_features
[rank0]: x = layer(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 434, in forward
[rank0]: x = blk(x)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(args, kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 292, in forward
[rank0]: attn_windows = self.attn(x_windows, mask=self.attn_mask) # nWB, window_sizewindow_size, C
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]: return inner()
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/home/oongjoon/Desktop/Github/flashattn_test/Swin-Transformer/models/swin_transformer_v2.py", line 159, in forward
[rank0]: logit_scale = torch.clamp(self.logit_scale, max=torch.log( torch.tensor(1. / 0.01) ) ).exp()
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)
[rank0]:[W115 16:39:20.040987096 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0115 16:39:21.135000 14576 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 14603) of binary: /home/oongjoon/Desktop/Github/flashattn/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 208, in
main()
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/typing_extensions.py", line 2853, in wrapper
return arg(*args, kwargs)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 204, in main
launch(args)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launch.py", line 189, in launch
run(args)
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/oongjoon/Desktop/Github/flashattn/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-15_16:39:21
host : oongjoon-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Generated:

swin_transformer_v2.py error RuntimeError fixed

2765bd7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

woongjoonchoi commented Jan 15, 2025 •

edited

Loading

swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

Are you sure you want to change the base?

swin_transformer_v2.py RuntimeError Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor) #376

Conversation

woongjoonchoi commented Jan 15, 2025 • edited Loading

main.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-01-15_16:39:21 host : oongjoon-System-Product-Name rank : 0 (local_rank: 0) exitcode : 1 (pid: 14603) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

woongjoonchoi commented Jan 15, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-15_16:39:21
host : oongjoon-System-Product-Name
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14603)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html