Model not training, training is stuck without errors #548

kulkarnivishal · 2022-05-16T19:24:33Z

Thank you for this repository

I'm trying to train Deep Cluster V2 on custom dataset that I successfully registered. However when I initiate training, the output logs are stuck at initialized host ... as rank 0. There's no error but no progress either.

Tried installing from source and installing through pip
I'm training the model in a Kubeflow notebook
Python 3.8 Pytorch 1.8.1 + cu 11.1
Tried training SimCLR on same data following the tutorial exactly and facing same issue
However training works successfully on CPU only Kubeflow notebook but unable to train Deep Cluster V2 and could only train the models for which cpu test configs are available

Please help. I've attached output logs below -

Instructions To Reproduce the Issue:

full code you wrote or full changes you made (git diff)
No Changes
what exact command you run:

! python tools/run_distributed_engines.py config=test/integration_test/quick_deepcluster_v2.yaml \
    config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
    config.DATA.TRAIN.DATASET_NAMES=[sample_crowley_passport] \
    config.DATA.TRAIN.DATA_PATHS=["/home/jovyan/Sampled/train"] \
    config.DATA.TEST.DATA_SOURCES=[disk_folder] \
    config.DATA.TEST.DATASET_NAMES=[sample_crowley_passport] \
    config.DATA.TEST.DATA_PATHS=["/home/jovyan/Sampled/test"] \
    config.CHECKPOINT.DIR="./checkpoints" \
    config.OPTIMIZER.num_epochs=50 \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \
    config.VERBOSE=true \
    config.LOG_FREQUENCY=3

sample_crowley_passport is the registered custom dataset

full logs you observed:

** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath 

####### overrides: ['config=test/integration_test/quick_deepcluster_v2.yaml', 'config.DATA.TRAIN.DATA_SOURCES=[disk_folder]', 'config.DATA.TRAIN.DATASET_NAMES=[sample_crowley_passport]', 'config.DATA.TRAIN.DATA_PATHS=[/home/jovyan/Sampled/train]', 'config.DATA.TEST.DATA_SOURCES=[disk_folder]', 'config.DATA.TEST.DATASET_NAMES=[sample_crowley_passport]', 'config.DATA.TEST.DATA_PATHS=[/home/jovyan/Sampled/test]', 'config.CHECKPOINT.DIR=./checkpoints', 'config.OPTIMIZER.num_epochs=50', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=1', 'config.VERBOSE=true', 'config.LOG_FREQUENCY=3']
INFO 2022-05-16 18:52:36,186 distributed_launcher.py: 183: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:46849

<**Bunch of port logs**>

INFO 2022-05-16 18:52:36,191 env.py:  50: _:	/opt/conda/bin/python
INFO 2022-05-16 18:52:36,191 misc.py: 161: Set start method of multiprocessing to forkserver
INFO 2022-05-16 18:52:36,191 train.py: 105: Setting seed....
INFO 2022-05-16 18:52:36,191 misc.py: 173: MACHINE SEED: 0
INFO 2022-05-16 18:52:36,193 hydra_config.py: 132: Training with config:
INFO 2022-05-16 18:52:36,201 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False,
                'AUTO_RESUME': True,
                'BACKEND': 'disk',
                'CHECKPOINT_FREQUENCY': 5,
                'CHECKPOINT_ITER_FREQUENCY': -1,
                'DIR': './checkpoints',
                'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1,
                'OVERWRITE_EXISTING': True,
                'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False},
 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss',
                'DATA_LIMIT': -1,
                'DATA_LIMIT_SAMPLING': {'SEED': 0},
                'FEATURES': {'DATASET_NAME': '',
                             'DATA_PARTITION': 'TRAIN',
                             'DIMENSIONALITY_REDUCTION': 0,
                             'EXTRACT': False,
                             'LAYER_NAME': '',
                             'PATH': '.',
                             'TEST_PARTITION': 'TEST'},
                'NUM_CLUSTERS': 16000,
                'NUM_ITER': 50,
                'OUTPUT_DIR': '.'},
 'DATA': {'DDP_BUCKET_CAP_MB': 25,
          'ENABLE_ASYNC_GPU_COPY': True,
          'NUM_DATALOADER_WORKERS': 5,
          'PIN_MEMORY': True,
          'TEST': {'BASE_DATASET': 'generic_ssl',
                   'BATCHSIZE_PER_REPLICA': 256,
                   'COLLATE_FUNCTION': 'default_collate',
                   'COLLATE_FUNCTION_PARAMS': {},
                   'COPY_DESTINATION_DIR': '',
                   'COPY_TO_LOCAL_DISK': False,
                   'DATASET_NAMES': ['sample_crowley_passport'],
                   'DATA_LIMIT': -1,
                   'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                           'SEED': 0,
                                           'SKIP_NUM_SAMPLES': 0},
                   'DATA_PATHS': ['/home/jovyan/Sampled/test'],
                   'DATA_SOURCES': ['disk_folder'],
                   'DEFAULT_GRAY_IMG_SIZE': 224,
                   'DROP_LAST': False,
                   'ENABLE_QUEUE_DATASET': False,
                   'INPUT_KEY_NAMES': ['data'],
                   'LABEL_PATHS': [],
                   'LABEL_SOURCES': [],
                   'LABEL_TYPE': 'sample_index',
                   'MMAP_MODE': True,
                   'NEW_IMG_PATH_PREFIX': '',
                   'RANDOM_SYNTHETIC_IMAGES': False,
                   'REMOVE_IMG_PATH_PREFIX': '',
                   'TARGET_KEY_NAMES': ['label'],
                   'TRANSFORMS': [],
                   'USE_DEBUGGING_SAMPLER': False,
                   'USE_STATEFUL_DISTRIBUTED_SAMPLER': False},
          'TRAIN': {'BASE_DATASET': 'generic_ssl',
                    'BATCHSIZE_PER_REPLICA': 16,
                    'COLLATE_FUNCTION': 'multicrop_collator',
                    'COLLATE_FUNCTION_PARAMS': {},
                    'COPY_DESTINATION_DIR': '/tmp/imagenet1k/',
                    'COPY_TO_LOCAL_DISK': False,
                    'DATASET_NAMES': ['sample_crowley_passport'],
                    'DATA_LIMIT': 250,
                    'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False,
                                            'SEED': 0,
                                            'SKIP_NUM_SAMPLES': 0},
                    'DATA_PATHS': ['/home/jovyan/Sampled/train'],
                    'DATA_SOURCES': ['disk_folder'],
                    'DEFAULT_GRAY_IMG_SIZE': 224,
                    'DROP_LAST': True,
                    'ENABLE_QUEUE_DATASET': False,
                    'INPUT_KEY_NAMES': ['data'],
                    'LABEL_PATHS': [],
                    'LABEL_SOURCES': [],
                    'LABEL_TYPE': 'sample_index',
                    'MMAP_MODE': True,
                    'NEW_IMG_PATH_PREFIX': '',
                    'RANDOM_SYNTHETIC_IMAGES': False,
                    'REMOVE_IMG_PATH_PREFIX': '',
                    'TARGET_KEY_NAMES': ['label'],
                    'TRANSFORMS': [{'crop_scales': [[0.14, 1]],
                                    'name': 'ImgPilToMultiCrop',
                                    'num_crops': [2],
                                    'size_crops': [224],
                                    'total_num_crops': 2},
                                   {'name': 'RandomHorizontalFlip', 'p': 0.5},
                                   {'name': 'ImgPilColorDistortion',
                                    'strength': 1.0},
                                   {'name': 'ImgPilGaussianBlur',
                                    'p': 0.5,
                                    'radius_max': 2.0,
                                    'radius_min': 0.1},
                                   {'name': 'ToTensor'},
                                   {'mean': [0.485, 0.456, 0.406],
                                    'name': 'Normalize',
                                    'std': [0.229, 0.224, 0.225]}],
                    'USE_DEBUGGING_SAMPLER': False,
                    'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}},
 'DISTRIBUTED': {'BACKEND': 'nccl',
                 'BROADCAST_BUFFERS': True,
                 'DISTR_ON': True,
                 'INIT_METHOD': 'tcp',
                 'MANUAL_GRADIENT_REDUCTION': False,
                 'NCCL_DEBUG': False,
                 'NCCL_SOCKET_NTHREADS': '',
                 'NUM_NODES': 1,
                 'NUM_PROC_PER_NODE': 1,
                 'RUN_ID': 'auto'},
 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''},
 'HOOKS': {'CHECK_NAN': True,
           'LOG_GPU_STATS': True,
           'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False,
                              'LOG_ITERATION_NUM': 0,
                              'PRINT_MEMORY_SUMMARY': True},
           'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False,
                                'INPUT_SHAPE': [3, 224, 224]},
           'PERF_STATS': {'MONITOR_PERF_STATS': True,
                          'PERF_STAT_FREQUENCY': 40,
                          'ROLLING_BTIME_FREQ': 5},
           'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard',
                                 'FLUSH_EVERY_N_MIN': 5,
                                 'LOG_DIR': '.',
                                 'LOG_PARAMS': True,
                                 'LOG_PARAMS_EVERY_N_ITERS': 310,
                                 'LOG_PARAMS_GRADIENTS': True,
                                 'USE_TENSORBOARD': False}},
 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False,
                   'DATASET_PATH': '',
                   'DEBUG_MODE': False,
                   'EVAL_BINARY_PATH': '',
                   'EVAL_DATASET_NAME': 'Paris',
                   'FEATS_PROCESSING_TYPE': '',
                   'GEM_POOL_POWER': 4.0,
                   'IMG_SCALINGS': [1],
                   'NORMALIZE_FEATURES': True,
                   'NUM_DATABASE_SAMPLES': -1,
                   'NUM_QUERY_SAMPLES': -1,
                   'NUM_TRAINING_SAMPLES': -1,
                   'N_PCA': 512,
                   'RESIZE_IMG': 1024,
                   'SAVE_FEATURES': False,
                   'SAVE_RETRIEVAL_RANKINGS_SCORES': True,
                   'SIMILARITY_MEASURE': 'cosine_similarity',
                   'SPATIAL_LEVELS': 3,
                   'TRAIN_DATASET_NAME': 'Oxford',
                   'TRAIN_PCA_WHITENING': True,
                   'USE_DISTRACTORS': False,
                   'WHITEN_IMG_LIST': ''},
 'LOG_FREQUENCY': 3,
 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1},
          'barlow_twins_loss': {'embedding_dim': 8192,
                                'lambda_': 0.0051,
                                'scale_loss': 0.024},
          'bce_logits_multiple_output_single_target': {'normalize_output': False,
                                                       'reduction': 'none',
                                                       'world_size': 1},
          'cross_entropy_multiple_output_single_target': {'ignore_index': -1,
                                                          'normalize_output': False,
                                                          'reduction': 'mean',
                                                          'temperature': 1.0,
                                                          'weight': None},
          'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 16,
                                 'DROP_LAST': True,
                                 'kmeans_iters': 10,
                                 'memory_params': {'crops_for_mb': [0, 1],
                                                   'embedding_dim': 128},
                                 'num_clusters': [10, 10, 10],
                                 'num_crops': 2,
                                 'num_train_samples': 500,
                                 'temperature': 0.1},
          'dino_loss': {'crops_for_teacher': [0, 1],
                        'ema_center': 0.9,
                        'momentum': 0.996,
                        'normalize_last_layer': True,
                        'output_dim': 65536,
                        'student_temp': 0.1,
                        'teacher_temp_max': 0.07,
                        'teacher_temp_min': 0.04,
                        'teacher_temp_warmup_iters': 37500},
          'moco_loss': {'embedding_dim': 128,
                        'momentum': 0.999,
                        'queue_size': 65536,
                        'temperature': 0.2},
          'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                               'embedding_dim': 128,
                                                               'world_size': 64},
                                             'num_crops': 2,
                                             'temperature': 0.1},
          'name': 'deepclusterv2_loss',
          'nce_loss_with_memory': {'loss_type': 'nce',
                                   'loss_weights': [1.0],
                                   'memory_params': {'embedding_dim': 128,
                                                     'memory_size': -1,
                                                     'momentum': 0.5,
                                                     'norm_init': True,
                                                     'update_mem_on_forward': True},
                                   'negative_sampling_params': {'num_negatives': 16000,
                                                                'type': 'random'},
                                   'norm_constant': -1,
                                   'norm_embedding': True,
                                   'num_train_samples': -1,
                                   'temperature': 0.07,
                                   'update_mem_with_emb_index': -100},
          'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096,
                                                     'embedding_dim': 128,
                                                     'world_size': 64},
                                   'temperature': 0.1},
          'swav_loss': {'crops_for_assign': [0, 1],
                        'embedding_dim': 128,
                        'epsilon': 0.05,
                        'normalize_last_layer': True,
                        'num_crops': 2,
                        'num_iters': 3,
                        'num_prototypes': [3000],
                        'output_dir': '.',
                        'queue': {'local_queue_length': 0,
                                  'queue_length': 0,
                                  'start_iter': 0},
                        'temp_hard_assignment_iters': 0,
                        'temperature': 0.1,
                        'use_double_precision': False},
          'swav_momentum_loss': {'crops_for_assign': [0, 1],
                                 'embedding_dim': 128,
                                 'epsilon': 0.05,
                                 'momentum': 0.99,
                                 'momentum_eval_mode_iter_start': 0,
                                 'normalize_last_layer': True,
                                 'num_crops': 2,
                                 'num_iters': 3,
                                 'num_prototypes': [3000],
                                 'queue': {'local_queue_length': 0,
                                           'queue_length': 0,
                                           'start_iter': 0},
                                 'temperature': 0.1,
                                 'use_double_precision': False}},
 'MACHINE': {'DEVICE': 'gpu'},
 'METERS': {'accuracy_list_meter': {'meter_names': [],
                                    'num_meters': 1,
                                    'topk_values': [1]},
            'enable_training_meter': True,
            'mean_ap_list_meter': {'max_cpu_capacity': -1,
                                   'meter_names': [],
                                   'num_classes': 9605,
                                   'num_meters': 1},
            'model_output_mask': False,
            'name': '',
            'names': [],
            'precision_at_k_list_meter': {'meter_names': [],
                                          'num_meters': 1,
                                          'topk_values': [1]},
            'recall_at_k_list_meter': {'meter_names': [],
                                       'num_meters': 1,
                                       'topk_values': [1]}},
 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2,
                                        'USE_ACTIVATION_CHECKPOINTING': False},
           'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O1'},
                          'AMP_TYPE': 'apex',
                          'USE_AMP': False},
           'BASE_MODEL_NAME': 'multi_input_output_model',
           'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100},
           'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False,
                                     'EVAL_TRUNK_AND_HEAD': False,
                                     'EXTRACT_TRUNK_FEATURES_ONLY': False,
                                     'FREEZE_TRUNK_AND_HEAD': False,
                                     'FREEZE_TRUNK_ONLY': False,
                                     'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [],
                                     'SHOULD_FLATTEN_FEATS': True},
           'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0,
                           'bucket_cap_mb': 0,
                           'clear_autocast_cache': True,
                           'compute_dtype': torch.float32,
                           'flatten_parameters': True,
                           'fp32_reduce_scatter': False,
                           'mixed_precision': True,
                           'verbose': True},
           'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False},
           'HEAD': {'BATCHNORM_EPS': 1e-05,
                    'BATCHNORM_MOMENTUM': 0.1,
                    'PARAMS': [['mlp',
                                {'dims': [2048, 2048],
                                 'skip_last_layer_relu_bn': False,
                                 'use_relu': True}],
                               ['mlp', {'dims': [2048, 128]}]],
                    'PARAMS_MULTIPLIER': 1.0},
           'INPUT_TYPE': 'rgb',
           'MULTI_INPUT_HEAD_MAPPING': [],
           'NON_TRAINABLE_PARAMS': [],
           'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1},
           'SINGLE_PASS_EVERY_CROP': False,
           'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True,
                              'GROUP_SIZE': -1,
                              'SYNC_BN_TYPE': 'pytorch'},
           'TEMP_FROZEN_PARAMS_ITER_MAP': [],
           'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False,
                                'LOCALITY_DIM': 10,
                                'LOCALITY_STRENGTH': 1.0,
                                'N_GPSA_LAYERS': 10,
                                'USE_LOCAL_INIT': True},
                     'EFFICIENT_NETS': {},
                     'NAME': 'resnet',
                     'REGNET': {},
                     'RESNETS': {'DEPTH': 50,
                                 'GROUPNORM_GROUPS': 32,
                                 'GROUPS': 1,
                                 'LAYER4_STRIDE': 2,
                                 'NORM': 'BatchNorm',
                                 'STANDARDIZE_CONVOLUTIONS': False,
                                 'WIDTH_MULTIPLIER': 1,
                                 'WIDTH_PER_GROUP': 64,
                                 'ZERO_INIT_RESIDUAL': False},
                     'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0,
                                             'CLASSIFIER': 'token',
                                             'DROPOUT_RATE': 0,
                                             'DROP_PATH_RATE': 0,
                                             'HIDDEN_DIM': 768,
                                             'IMAGE_SIZE': 224,
                                             'MLP_DIM': 3072,
                                             'NUM_HEADS': 12,
                                             'NUM_LAYERS': 12,
                                             'PATCH_SIZE': 16,
                                             'QKV_BIAS': False,
                                             'QK_SCALE': False,
                                             'name': None},
                     'XCIT': {'ATTENTION_DROPOUT_RATE': 0,
                              'DROPOUT_RATE': 0,
                              'DROP_PATH_RATE': 0.05,
                              'ETA': 1,
                              'HIDDEN_DIM': 384,
                              'IMAGE_SIZE': 224,
                              'NUM_HEADS': 8,
                              'NUM_LAYERS': 12,
                              'PATCH_SIZE': 16,
                              'QKV_BIAS': True,
                              'QK_SCALE': False,
                              'TOKENS_NORM': True,
                              'name': None}},
           'WEIGHTS_INIT': {'APPEND_PREFIX': '',
                            'PARAMS_FILE': '',
                            'REMOVE_PREFIX': '',
                            'SKIP_LAYERS': ['num_batches_tracked'],
                            'STATE_DICT_KEY_NAME': 'classy_state_dict'},
           '_MODEL_INIT_SEED': 0},
 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0},
 'MULTI_PROCESSING_METHOD': 'forkserver',
 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200},
 'OPTIMIZER': {'betas': [0.9, 0.999],
               'construct_single_param_group_only': False,
               'head_optimizer_params': {'use_different_lr': False,
                                         'use_different_wd': False,
                                         'weight_decay': 1e-06},
               'larc_config': {'clip': False,
                               'eps': 1e-08,
                               'trust_coefficient': 0.001},
               'momentum': 0.9,
               'name': 'sgd',
               'nesterov': False,
               'non_regularized_parameters': [],
               'num_epochs': 50,
               'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True,
                                                               'base_lr_batch_size': 256,
                                                               'base_value': 0.3,
                                                               'scaling_type': 'linear'},
                                           'end_value': 0.0,
                                           'interval_scaling': [],
                                           'lengths': [],
                                           'milestones': [30, 60],
                                           'name': 'cosine',
                                           'schedulers': [],
                                           'start_value': 0.01875,
                                           'update_interval': 'step',
                                           'value': 0.1,
                                           'values': [0.1, 0.01, 0.001]},
                                    'lr_head': {'auto_lr_scaling': {'auto_scale': True,
                                                                    'base_lr_batch_size': 256,
                                                                    'base_value': 0.3,
                                                                    'scaling_type': 'linear'},
                                                'end_value': 0.0,
                                                'interval_scaling': [],
                                                'lengths': [],
                                                'milestones': [30, 60],
                                                'name': 'cosine',
                                                'schedulers': [],
                                                'start_value': 0.01875,
                                                'update_interval': 'step',
                                                'value': 0.1,
                                                'values': [0.1, 0.01, 0.001]}},
               'regularize_bias': True,
               'regularize_bn': True,
               'use_larc': True,
               'use_zero': False,
               'weight_decay': 1e-06},
 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False},
               'NUM_ITERATIONS': 10,
               'OUTPUT_FOLDER': '.',
               'PROFILED_RANKS': [0, 1],
               'RUNTIME_PROFILING': {'LEGACY_PROFILER': False,
                                     'PROFILE_CPU': True,
                                     'PROFILE_GPU': True,
                                     'USE_PROFILER': False},
               'START_ITERATION': 0,
               'STOP_TRAINING_AFTER_PROFILING': False,
               'WARMUP_ITERATIONS': 0},
 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False},
 'SEED_VALUE': 0,
 'SLURM': {'ADDITIONAL_PARAMETERS': {},
           'COMMENT': 'vissl job',
           'CONSTRAINT': '',
           'LOG_FOLDER': '.',
           'MEM_GB': 250,
           'NAME': 'vissl',
           'NUM_CPU_PER_PROC': 8,
           'PARTITION': '',
           'PORT_ID': 40050,
           'TIME_HOURS': 72,
           'TIME_MINUTES': 0,
           'USE_SLURM': False},
 'SVM': {'cls_list': [],
         'costs': {'base': -1.0,
                   'costs_list': [0.1, 0.01],
                   'power_range': [4, 20]},
         'cross_val_folds': 3,
         'dual': True,
         'force_retrain': False,
         'loss': 'squared_hinge',
         'low_shot': {'dataset_name': 'voc',
                      'k_values': [1, 2, 4, 8, 16, 32, 64, 96],
                      'sample_inds': [1, 2, 3, 4, 5]},
         'max_iter': 2000,
         'normalize': True,
         'penalty': 'l2'},
 'TEST_EVERY_NUM_EPOCH': 1,
 'TEST_MODEL': False,
 'TEST_ONLY': False,
 'TRAINER': {'TASK_NAME': 'self_supervision_task',
             'TRAIN_STEP_NAME': 'standard_train_step'},
 'VERBOSE': True}
INFO 2022-05-16 18:52:36,801 train.py: 117: System config:
-------------------  -------------------------------------------------------------------------------
sys.platform         linux
Python               3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0]
numpy                1.19.5
Pillow               9.0.1
vissl                0.1.6 @/home/jovyan/vissl/vissl
GPU available        True
GPU 0                Tesla K80
CUDA_HOME            /usr/local/cuda
torchvision          0.9.1+cu101 @/opt/conda/lib/python3.8/site-packages/torchvision
hydra                1.0.7 @/opt/conda/lib/python3.8/site-packages/hydra
classy_vision        0.7.0.dev @/opt/conda/lib/python3.8/site-packages/classy_vision
tensorboard          2.8.0
apex                 0.1 @/opt/conda/lib/python3.8/site-packages/apex
cv2                  4.5.5
PyTorch              1.8.1+cu111 @/opt/conda/lib/python3.8/site-packages/torch
PyTorch debug build  False
-------------------  -------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------------------  ---------------------------------------------------------------------------------------------
Architecture                     x86_64
CPU op-mode(s)                   32-bit, 64-bit
Byte Order                       Little Endian
Address sizes                    46 bits physical, 48 bits virtual
CPU(s)                           8
On-line CPU(s) list              0-7
Thread(s) per core               2
Core(s) per socket               4
Socket(s)                        1
NUMA node(s)                     1
Vendor ID                        GenuineIntel
CPU family                       6
Model                            79
Model name                       Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping                         0
CPU MHz                          2199.998
BogoMIPS                         4399.99
Hypervisor vendor                KVM
Virtualization type              full
L1d cache                        128 KiB
L1i cache                        128 KiB
L2 cache                         1 MiB
L3 cache                         55 MiB
NUMA node0 CPU(s)                0-7
Vulnerability Itlb multihit      Not affected
Vulnerability L1tf               Mitigation; PTE Inversion
Vulnerability Mds                Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown           Mitigation; PTI
Vulnerability Spec store bypass  Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1         Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2         Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds              Not affected
Vulnerability Tsx async abort    Mitigation; Clear CPU buffers; SMT Host state unknown
-------------------------------  ---------------------------------------------------------------------------------------------
INFO 2022-05-16 18:52:36,802 trainer_main.py: 112: Using Distributed init method: tcp://localhost:46849, world_size: 1, rank: 0
INFO 2022-05-16 18:52:36,803 distributed_c10d.py: 187: Added key: store_based_barrier_key:1 to store for rank: 0
INFO 2022-05-16 18:52:36,804 trainer_main.py: 130: | initialized host vishal-pyt-cu111-0 as rank 0 (0)

Nothing after this

Expected behavior:

I'd expect the model to train or code to throw an error, if any.

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

If you expect the model to converge / work better, note that we do not give suggestions
on how to train a new model.
Only in one of the two conditions, we will help with it:
(1) You're unable to reproduce the results in vissl model zoo.
(2) It indicates a vissl bug.

Environment:

Provide your environment information using the following command:

sys.platform         linux
Python               3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0]
numpy                1.19.5
Pillow               9.0.1
vissl                0.1.6 @/home/jovyan/vissl/vissl
GPU available        True
GPU 0                Tesla K80
CUDA_HOME            /usr/local/cuda
torchvision          0.9.1+cu101 @/opt/conda/lib/python3.8/site-packages/torchvision
hydra                1.0.7 @/opt/conda/lib/python3.8/site-packages/hydra
classy_vision        0.7.0.dev @/opt/conda/lib/python3.8/site-packages/classy_vision
tensorboard          2.8.0
apex                 0.1 @/opt/conda/lib/python3.8/site-packages/apex
cv2                  4.5.5
PyTorch              1.8.1+cu111 @/opt/conda/lib/python3.8/site-packages/torch
PyTorch debug build  False
-------------------  -------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

CPU info:
-------------------------------  ---------------------------------------------------------------------------------------------
Architecture                     x86_64
CPU op-mode(s)                   32-bit, 64-bit
Byte Order                       Little Endian
Address sizes                    46 bits physical, 48 bits virtual
CPU(s)                           8
On-line CPU(s) list              0-7
Thread(s) per core               2
Core(s) per socket               4
Socket(s)                        1
NUMA node(s)                     1
Vendor ID                        GenuineIntel
CPU family                       6
Model                            79
Model name                       Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping                         0
CPU MHz                          2199.998
BogoMIPS                         4399.99
Hypervisor vendor                KVM
Virtualization type              full
L1d cache                        128 KiB
L1i cache                        128 KiB
L2 cache                         1 MiB
L3 cache                         55 MiB
NUMA node0 CPU(s)                0-7
Vulnerability Itlb multihit      Not affected
Vulnerability L1tf               Mitigation; PTE Inversion
Vulnerability Mds                Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown           Mitigation; PTI
Vulnerability Spec store bypass  Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1         Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2         Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds              Not affected
Vulnerability Tsx async abort    Mitigation; Clear CPU buffers; SMT Host state unknown

Tried installing ViSSL with both direct installation and from source and face the same issue: training not progressing

If your issue looks like an installation issue / environment issue,
please first try to solve it with the instructions in
https://github.com/facebookresearch/vissl/tree/main/docs

The text was updated successfully, but these errors were encountered:

QuentinDuval · 2022-06-27T19:19:50Z

Hi @kulkarnivishal,

Thank you for using VISSL :) Sorry for the late answer (I got first COVID then went into 1 month PTO).

So this really looks like an environment issue with distributed training. The initialisation of the distributed group seems to have gone fine, but maybe the test of the distributed training has failed:

In the code of trainer_main.py, there is a call to dist.all_reduce(torch.zeros(1).cuda()) right after the initialisation of the distributed training that we saw in your logs. It might be what is failing but we need to make sure of it to decide on the next steps.

If you installed from source, could you add some logs around the dist.all_reduce(torch.zeros(1).cuda()) in the setup_distributed function of the trainer_main.py? Could you also add some logs in the following places:

before and after self.task = build_task(self.cfg) in trainer_main.py
before and after self.task.init_distributed_data_parallel_model() in trainer_main.py

And then re-run your exact command to check what we get.

Thank you,
Quentin

tungts1101 · 2024-01-30T08:02:58Z

Today I have tried vissl and stumbled upon the same error. I have followed your suggestion but nothing printed out to the screen.

kulkarnivishal closed this as completed May 31, 2022

kulkarnivishal reopened this May 31, 2022

QuentinDuval self-assigned this Jun 27, 2022

QuentinDuval added the awaiting-user-response label Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model not training, training is stuck without errors #548

Model not training, training is stuck without errors #548

kulkarnivishal commented May 16, 2022 •

edited

Loading

QuentinDuval commented Jun 27, 2022

tungts1101 commented Jan 30, 2024

Model not training, training is stuck without errors #548

Model not training, training is stuck without errors #548

Comments

kulkarnivishal commented May 16, 2022 • edited Loading

Instructions To Reproduce the Issue:

Expected behavior:

Environment:

QuentinDuval commented Jun 27, 2022

tungts1101 commented Jan 30, 2024

kulkarnivishal commented May 16, 2022 •

edited

Loading