How to train a model on a single GPU without distributed setting? #585

keiohta · 2022-11-19T21:32:55Z

keiohta
Nov 19, 2022

Hi authors, thanks for sharing a great codebase!

I want to train a model without a distributed setting to put breakpoints to look into codes, but encountered an error related to distributed setting RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.. The full error is shown below. So, how can I train a model without distributed setting?

python tools/train.py configs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224.py 
/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
  warnings.warn(
/homes/ota/workspace/robotics/mmselfsup/mmselfsup/utils/setup_env.py:32: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
/homes/ota/workspace/robotics/mmselfsup/mmselfsup/utils/setup_env.py:42: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  warnings.warn(
2022-11-19 16:27:48,993 - mmselfsup - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 Ti
CUDA_HOME: /homes/ota/anaconda3/envs/tactile_insertion
NVCC: Cuda compilation tools, release 11.7, V11.7.99
GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
PyTorch: 1.12.1.post201
PyTorch compiling details: PyTorch built with:
  - GCC 10.4
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - Built with CUDA Runtime 11.2
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
  - CuDNN 8.4.1  (built against CUDA 11.6)
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.2, CUDNN_VERSION=8.4.1, CXX_COMPILER=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_build_env/bin/x86_64-conda-linux-gnu-c++, CXX_FLAGS=-std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work=/usr/local/src/conda/pytorch-1.12.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -Wno-deprecated-declarations -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=1, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.13.0a0+8069656
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.7
MMSelfSup: 0.10.1+f5a38fc
------------------------------------------------------------

2022-11-19 16:27:49,615 - mmselfsup - INFO - Distributed training: False
2022-11-19 16:27:50,216 - mmselfsup - INFO - Config:
model = dict(
    type='MoCoV3',
    base_momentum=0.99,
    backbone=dict(
        type='VisionTransformer',
        arch='mocov3-small',
        img_size=224,
        patch_size=16,
        stop_grad_conv1=True),
    neck=dict(
        type='NonLinearNeck',
        in_channels=384,
        hid_channels=4096,
        out_channels=256,
        num_layers=3,
        with_bias=False,
        with_last_bn=True,
        with_last_bn_affine=False,
        with_last_bias=False,
        with_avg_pool=False,
        vit_backbone=True),
    head=dict(
        type='MoCoV3Head',
        predictor=dict(
            type='NonLinearNeck',
            in_channels=256,
            hid_channels=4096,
            out_channels=256,
            num_layers=2,
            with_bias=False,
            with_last_bn=True,
            with_last_bn_affine=False,
            with_last_bias=False,
            with_avg_pool=False),
        temperature=0.2))
data_source = 'ImageNet'
dataset_type = 'MultiViewDataset'
img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
train_pipeline1 = [
    dict(type='RandomResizedCrop', size=224, scale=(0.08, 1.0)),
    dict(
        type='RandomAppliedTrans',
        transforms=[
            dict(
                type='ColorJitter',
                brightness=0.4,
                contrast=0.4,
                saturation=0.2,
                hue=0.1)
        ],
        p=0.8),
    dict(type='RandomGrayscale', p=0.2),
    dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=1.0),
    dict(type='Solarization', p=0.0),
    dict(type='RandomHorizontalFlip'),
    dict(type='ToTensor'),
    dict(
        type='Normalize',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225])
]
train_pipeline2 = [
    dict(type='RandomResizedCrop', size=224, scale=(0.08, 1.0)),
    dict(
        type='RandomAppliedTrans',
        transforms=[
            dict(
                type='ColorJitter',
                brightness=0.4,
                contrast=0.4,
                saturation=0.2,
                hue=0.1)
        ],
        p=0.8),
    dict(type='RandomGrayscale', p=0.2),
    dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=0.1),
    dict(type='Solarization', p=0.2),
    dict(type='RandomHorizontalFlip'),
    dict(type='ToTensor'),
    dict(
        type='Normalize',
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225])
]
prefetch = False
data = dict(
    samples_per_gpu=128,
    workers_per_gpu=4,
    train=dict(
        type='MultiViewDataset',
        data_source=dict(
            type='ImageNet',
            data_prefix=
            '/projects/DA/ContactRichRoboticManipulation/ImageNet/train',
            ann_file=
            '/projects/DA/ContactRichRoboticManipulation/ImageNet/meta/train.txt'
        ),
        num_views=[1, 1],
        pipelines=[[{
            'type': 'RandomResizedCrop',
            'size': 224,
            'scale': (0.08, 1.0)
        }, {
            'type':
            'RandomAppliedTrans',
            'transforms': [{
                'type': 'ColorJitter',
                'brightness': 0.4,
                'contrast': 0.4,
                'saturation': 0.2,
                'hue': 0.1
            }],
            'p':
            0.8
        }, {
            'type': 'RandomGrayscale',
            'p': 0.2
        }, {
            'type': 'GaussianBlur',
            'sigma_min': 0.1,
            'sigma_max': 2.0,
            'p': 1.0
        }, {
            'type': 'Solarization',
            'p': 0.0
        }, {
            'type': 'RandomHorizontalFlip'
        }, {
            'type': 'ToTensor'
        }, {
            'type': 'Normalize',
            'mean': [0.485, 0.456, 0.406],
            'std': [0.229, 0.224, 0.225]
        }],
                   [{
                       'type': 'RandomResizedCrop',
                       'size': 224,
                       'scale': (0.08, 1.0)
                   }, {
                       'type':
                       'RandomAppliedTrans',
                       'transforms': [{
                           'type': 'ColorJitter',
                           'brightness': 0.4,
                           'contrast': 0.4,
                           'saturation': 0.2,
                           'hue': 0.1
                       }],
                       'p':
                       0.8
                   }, {
                       'type': 'RandomGrayscale',
                       'p': 0.2
                   }, {
                       'type': 'GaussianBlur',
                       'sigma_min': 0.1,
                       'sigma_max': 2.0,
                       'p': 0.1
                   }, {
                       'type': 'Solarization',
                       'p': 0.2
                   }, {
                       'type': 'RandomHorizontalFlip'
                   }, {
                       'type': 'ToTensor'
                   }, {
                       'type': 'Normalize',
                       'mean': [0.485, 0.456, 0.406],
                       'std': [0.229, 0.224, 0.225]
                   }]],
        prefetch=False))
optimizer = dict(type='AdamW', lr=0.0024, weight_decay=0.1)
optimizer_config = dict()
lr_config = dict(
    policy='CosineAnnealing',
    by_epoch=False,
    min_lr=0.0,
    warmup='linear',
    warmup_iters=40,
    warmup_ratio=0.0001,
    warmup_by_epoch=True)
runner = dict(type='EpochBasedRunner', max_epochs=300)
checkpoint_config = dict(interval=10, max_keep_ckpts=3)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
cudnn_benchmark = True
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
persistent_workers = True
opencv_num_threads = 0
mp_start_method = 'fork'
custom_hooks = [dict(type='MomentumUpdateHook')]
fp16 = dict(loss_scale='dynamic')
work_dir = './work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224'
auto_resume = False
gpu_ids = [0]

2022-11-19 16:27:50,216 - mmselfsup - INFO - Set random seed to 1971259339, deterministic: False
/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work/aten/src/ATen/native/TensorShape.cpp:2894.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2022-11-19 16:27:50,758 - mmselfsup - INFO - initialize NonLinearNeck with init_cfg [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}]
2022-11-19 16:27:50,763 - mmselfsup - INFO - initialize NonLinearNeck with init_cfg [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}]
2022-11-19 16:27:55,216 - mmselfsup - INFO - Start running, host: ota@intern252dt, work_dir: /homes/ota/workspace/robotics/mmselfsup/work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224
2022-11-19 16:27:55,217 - mmselfsup - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(ABOVE_NORMAL) GradAccumFp16OptimizerHook         
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_epoch:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_train_iter:
(VERY_HIGH   ) CosineAnnealingLrUpdaterHook       
(NORMAL      ) MomentumUpdateHook                 
(LOW         ) IterTimerHook                      
 -------------------- 
after_train_iter:
(ABOVE_NORMAL) GradAccumFp16OptimizerHook         
(NORMAL      ) CheckpointHook                     
(NORMAL      ) MomentumUpdateHook                 
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_train_epoch:
(NORMAL      ) CheckpointHook                     
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_epoch:
(LOW         ) IterTimerHook                      
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
before_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_iter:
(LOW         ) IterTimerHook                      
 -------------------- 
after_val_epoch:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
after_run:
(VERY_LOW    ) TextLoggerHook                     
 -------------------- 
2022-11-19 16:27:55,217 - mmselfsup - INFO - workflow: [('train', 1)], max: 300 epochs
2022-11-19 16:27:55,218 - mmselfsup - INFO - Checkpoints will be saved to /homes/ota/workspace/robotics/mmselfsup/work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224 by HardDiskBackend.
Traceback (most recent call last):
  File "/homes/ota/workspace/robotics/mmselfsup/tools/train.py", line 200, in <module>
    main()
  File "/homes/ota/workspace/robotics/mmselfsup/tools/train.py", line 190, in main
    train_model(
  File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/apis/train.py", line 216, in train_model
    runner.run(data_loaders, cfg.workflow)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/base.py", line 132, in train_step
    losses = self(**data)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/base.py", line 62, in forward
    return self.forward_train(img, **kwargs)
  File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/mocov3.py", line 94, in forward_train
    q1 = self.base_encoder(view_1)[0]
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/necks/nonlinear_neck.py", line 101, in forward
    x = self.bn0(x)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
    world_size = torch.distributed.get_world_size(process_group)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
    return _get_group_size(group)
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
    default_pg = _get_default_group()
  File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Answered by fangyixiao18

Nov 22, 2022

Hi, for single GPU training, we recommend to use our 1.x branch, which fully support single GPU training. As for master branch, you need to modify the config, use BN instead of SyncBN, and check the codes in the algorithm, modify some operations, like all_gather, etc.

View full answer

fangyixiao18 · 2022-11-22T02:59:21Z

fangyixiao18
Nov 22, 2022
Maintainer

Hi, for single GPU training, we recommend to use our 1.x branch, which fully support single GPU training. As for master branch, you need to modify the config, use BN instead of SyncBN, and check the codes in the algorithm, modify some operations, like all_gather, etc.

1 reply

keiohta Nov 22, 2022
Author

Hi @fangyixiao18, thanks for letting me know the workaround. Actually, I somehow managed to put breakpoints using pycharm with distributed training mode, so my problem is resolved.
I write down how I did that as follows in case somebody wants to do the same thing:

script path: /path/to/torch/distributed/launch.py
parameter: parameters to run the torch.distributed.launch. for example, if you want to train MoCoV3 on ImageNet:
- --nnodes=1 --node_rank=0 --master_addr=127.0.0.1 --nproc_per_node=1 --master_port=29500 tools/train.py configs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224.py --seed 0 --launcher pytorch --work-dir work_dirs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224/
working directory: /path/to/mmselfsup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train a model on a single GPU without distributed setting? #585

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How to train a model on a single GPU without distributed setting? #585

keiohta Nov 19, 2022

Replies: 1 comment · 1 reply

fangyixiao18 Nov 22, 2022 Maintainer

keiohta Nov 22, 2022 Author

keiohta
Nov 19, 2022

Replies: 1 comment 1 reply

fangyixiao18
Nov 22, 2022
Maintainer

keiohta Nov 22, 2022
Author