How to train a model on a single GPU without distributed setting? #585
-
Hi authors, thanks for sharing a great codebase! I want to train a model without a distributed setting to put breakpoints to look into codes, but encountered an error related to distributed setting python tools/train.py configs/selfsup/mocov3/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224.py
/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
/homes/ota/workspace/robotics/mmselfsup/mmselfsup/utils/setup_env.py:32: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
/homes/ota/workspace/robotics/mmselfsup/mmselfsup/utils/setup_env.py:42: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
warnings.warn(
2022-11-19 16:27:48,993 - mmselfsup - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) [GCC 10.3.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3090 Ti
CUDA_HOME: /homes/ota/anaconda3/envs/tactile_insertion
NVCC: Cuda compilation tools, release 11.7, V11.7.99
GCC: gcc (Ubuntu 7.5.0-6ubuntu2) 7.5.0
PyTorch: 1.12.1.post201
PyTorch compiling details: PyTorch built with:
- GCC 10.4
- C++ Version: 201402
- Intel(R) oneAPI Math Kernel Library Version 2022.1-Product Build 20220311 for Intel(R) 64 architecture applications
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.7
- Built with CUDA Runtime 11.2
- NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
- CuDNN 8.4.1 (built against CUDA 11.6)
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.2, CUDNN_VERSION=8.4.1, CXX_COMPILER=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_build_env/bin/x86_64-conda-linux-gnu-c++, CXX_FLAGS=-std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work=/usr/local/src/conda/pytorch-1.12.1 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -Wno-deprecated-declarations -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=1, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.0a0+8069656
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.7
MMSelfSup: 0.10.1+f5a38fc
------------------------------------------------------------
2022-11-19 16:27:49,615 - mmselfsup - INFO - Distributed training: False
2022-11-19 16:27:50,216 - mmselfsup - INFO - Config:
model = dict(
type='MoCoV3',
base_momentum=0.99,
backbone=dict(
type='VisionTransformer',
arch='mocov3-small',
img_size=224,
patch_size=16,
stop_grad_conv1=True),
neck=dict(
type='NonLinearNeck',
in_channels=384,
hid_channels=4096,
out_channels=256,
num_layers=3,
with_bias=False,
with_last_bn=True,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=False,
vit_backbone=True),
head=dict(
type='MoCoV3Head',
predictor=dict(
type='NonLinearNeck',
in_channels=256,
hid_channels=4096,
out_channels=256,
num_layers=2,
with_bias=False,
with_last_bn=True,
with_last_bn_affine=False,
with_last_bias=False,
with_avg_pool=False),
temperature=0.2))
data_source = 'ImageNet'
dataset_type = 'MultiViewDataset'
img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
train_pipeline1 = [
dict(type='RandomResizedCrop', size=224, scale=(0.08, 1.0)),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2),
dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=1.0),
dict(type='Solarization', p=0.0),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(
type='Normalize',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]
train_pipeline2 = [
dict(type='RandomResizedCrop', size=224, scale=(0.08, 1.0)),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2),
dict(type='GaussianBlur', sigma_min=0.1, sigma_max=2.0, p=0.1),
dict(type='Solarization', p=0.2),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(
type='Normalize',
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]
prefetch = False
data = dict(
samples_per_gpu=128,
workers_per_gpu=4,
train=dict(
type='MultiViewDataset',
data_source=dict(
type='ImageNet',
data_prefix=
'/projects/DA/ContactRichRoboticManipulation/ImageNet/train',
ann_file=
'/projects/DA/ContactRichRoboticManipulation/ImageNet/meta/train.txt'
),
num_views=[1, 1],
pipelines=[[{
'type': 'RandomResizedCrop',
'size': 224,
'scale': (0.08, 1.0)
}, {
'type':
'RandomAppliedTrans',
'transforms': [{
'type': 'ColorJitter',
'brightness': 0.4,
'contrast': 0.4,
'saturation': 0.2,
'hue': 0.1
}],
'p':
0.8
}, {
'type': 'RandomGrayscale',
'p': 0.2
}, {
'type': 'GaussianBlur',
'sigma_min': 0.1,
'sigma_max': 2.0,
'p': 1.0
}, {
'type': 'Solarization',
'p': 0.0
}, {
'type': 'RandomHorizontalFlip'
}, {
'type': 'ToTensor'
}, {
'type': 'Normalize',
'mean': [0.485, 0.456, 0.406],
'std': [0.229, 0.224, 0.225]
}],
[{
'type': 'RandomResizedCrop',
'size': 224,
'scale': (0.08, 1.0)
}, {
'type':
'RandomAppliedTrans',
'transforms': [{
'type': 'ColorJitter',
'brightness': 0.4,
'contrast': 0.4,
'saturation': 0.2,
'hue': 0.1
}],
'p':
0.8
}, {
'type': 'RandomGrayscale',
'p': 0.2
}, {
'type': 'GaussianBlur',
'sigma_min': 0.1,
'sigma_max': 2.0,
'p': 0.1
}, {
'type': 'Solarization',
'p': 0.2
}, {
'type': 'RandomHorizontalFlip'
}, {
'type': 'ToTensor'
}, {
'type': 'Normalize',
'mean': [0.485, 0.456, 0.406],
'std': [0.229, 0.224, 0.225]
}]],
prefetch=False))
optimizer = dict(type='AdamW', lr=0.0024, weight_decay=0.1)
optimizer_config = dict()
lr_config = dict(
policy='CosineAnnealing',
by_epoch=False,
min_lr=0.0,
warmup='linear',
warmup_iters=40,
warmup_ratio=0.0001,
warmup_by_epoch=True)
runner = dict(type='EpochBasedRunner', max_epochs=300)
checkpoint_config = dict(interval=10, max_keep_ckpts=3)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
cudnn_benchmark = True
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
persistent_workers = True
opencv_num_threads = 0
mp_start_method = 'fork'
custom_hooks = [dict(type='MomentumUpdateHook')]
fp16 = dict(loss_scale='dynamic')
work_dir = './work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224'
auto_resume = False
gpu_ids = [0]
2022-11-19 16:27:50,216 - mmselfsup - INFO - Set random seed to 1971259339, deterministic: False
/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664405705473/work/aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
2022-11-19 16:27:50,758 - mmselfsup - INFO - initialize NonLinearNeck with init_cfg [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}]
2022-11-19 16:27:50,763 - mmselfsup - INFO - initialize NonLinearNeck with init_cfg [{'type': 'Constant', 'val': 1, 'layer': ['_BatchNorm', 'GroupNorm']}]
2022-11-19 16:27:55,216 - mmselfsup - INFO - Start running, host: ota@intern252dt, work_dir: /homes/ota/workspace/robotics/mmselfsup/work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224
2022-11-19 16:27:55,217 - mmselfsup - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(ABOVE_NORMAL) GradAccumFp16OptimizerHook
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_epoch:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
before_train_iter:
(VERY_HIGH ) CosineAnnealingLrUpdaterHook
(NORMAL ) MomentumUpdateHook
(LOW ) IterTimerHook
--------------------
after_train_iter:
(ABOVE_NORMAL) GradAccumFp16OptimizerHook
(NORMAL ) CheckpointHook
(NORMAL ) MomentumUpdateHook
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
after_train_epoch:
(NORMAL ) CheckpointHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_epoch:
(LOW ) IterTimerHook
(VERY_LOW ) TextLoggerHook
--------------------
before_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_iter:
(LOW ) IterTimerHook
--------------------
after_val_epoch:
(VERY_LOW ) TextLoggerHook
--------------------
after_run:
(VERY_LOW ) TextLoggerHook
--------------------
2022-11-19 16:27:55,217 - mmselfsup - INFO - workflow: [('train', 1)], max: 300 epochs
2022-11-19 16:27:55,218 - mmselfsup - INFO - Checkpoints will be saved to /homes/ota/workspace/robotics/mmselfsup/work_dirs/selfsup/mocov3_vit-small-p16_32xb128-fp16-coslr-300e_in1k-224 by HardDiskBackend.
Traceback (most recent call last):
File "/homes/ota/workspace/robotics/mmselfsup/tools/train.py", line 200, in <module>
main()
File "/homes/ota/workspace/robotics/mmselfsup/tools/train.py", line 190, in main
train_model(
File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/apis/train.py", line 216, in train_model
runner.run(data_loaders, cfg.workflow)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/base.py", line 132, in train_step
losses = self(**data)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 149, in new_func
output = old_func(*new_args, **new_kwargs)
File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/base.py", line 62, in forward
return self.forward_train(img, **kwargs)
File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/algorithms/mocov3.py", line 94, in forward_train
q1 = self.base_encoder(view_1)[0]
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/homes/ota/workspace/robotics/mmselfsup/mmselfsup/models/necks/nonlinear_neck.py", line 101, in forward
x = self.bn0(x)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 731, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
return _get_group_size(group)
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
default_pg = _get_default_group()
File "/homes/ota/anaconda3/envs/tactile_insertion/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, for single GPU training, we recommend to use our 1.x branch, which fully support single GPU training. As for master branch, you need to modify the config, use BN instead of SyncBN, and check the codes in the algorithm, modify some operations, like all_gather, etc. |
Beta Was this translation helpful? Give feedback.
Hi, for single GPU training, we recommend to use our 1.x branch, which fully support single GPU training. As for master branch, you need to modify the config, use BN instead of SyncBN, and check the codes in the algorithm, modify some operations, like all_gather, etc.