You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.
I'm trying to train Deep Cluster V2 on custom dataset that I successfully registered. However when I initiate training, the output logs are stuck at initialized host ... as rank 0. There's no error but no progress either.
Tried installing from source and installing through pip
I'm training the model in a Kubeflow notebook
Python 3.8 Pytorch 1.8.1 + cu 11.1
Tried training SimCLR on same data following the tutorial exactly and facing same issue
However training works successfully on CPU only Kubeflow notebook but unable to train Deep Cluster V2 and could only train the models for which cpu test configs are available
Please help. I've attached output logs below -
Instructions To Reproduce the Issue:
full code you wrote or full changes you made (git diff)
No Changes
I'd expect the model to train or code to throw an error, if any.
If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.
If you expect the model to converge / work better, note that we do not give suggestions
on how to train a new model.
Only in one of the two conditions, we will help with it:
(1) You're unable to reproduce the results in vissl model zoo.
(2) It indicates a vissl bug.
Environment:
Provide your environment information using the following command:
sys.platform linux
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0]
numpy 1.19.5
Pillow 9.0.1
vissl 0.1.6 @/home/jovyan/vissl/vissl
GPU available True
GPU 0 Tesla K80
CUDA_HOME /usr/local/cuda
torchvision 0.9.1+cu101 @/opt/conda/lib/python3.8/site-packages/torchvision
hydra 1.0.7 @/opt/conda/lib/python3.8/site-packages/hydra
classy_vision 0.7.0.dev @/opt/conda/lib/python3.8/site-packages/classy_vision
tensorboard 2.8.0
apex 0.1 @/opt/conda/lib/python3.8/site-packages/apex
cv2 4.5.5
PyTorch 1.8.1+cu111 @/opt/conda/lib/python3.8/site-packages/torch
PyTorch debug build False
------------------- -------------------------------------------------------------------------------
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.0.5
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
CPU info:
------------------------------- ---------------------------------------------------------------------------------------------
Architecture x86_64
CPU op-mode(s) 32-bit, 64-bit
Byte Order Little Endian
Address sizes 46 bits physical, 48 bits virtual
CPU(s) 8
On-line CPU(s) list 0-7
Thread(s) per core 2
Core(s) per socket 4
Socket(s) 1
NUMA node(s) 1
Vendor ID GenuineIntel
CPU family 6
Model 79
Model name Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping 0
CPU MHz 2199.998
BogoMIPS 4399.99
Hypervisor vendor KVM
Virtualization type full
L1d cache 128 KiB
L1i cache 128 KiB
L2 cache 1 MiB
L3 cache 55 MiB
NUMA node0 CPU(s) 0-7
Vulnerability Itlb multihit Not affected
Vulnerability L1tf Mitigation; PTE Inversion
Vulnerability Mds Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown Mitigation; PTI
Vulnerability Spec store bypass Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1 Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2 Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds Not affected
Vulnerability Tsx async abort Mitigation; Clear CPU buffers; SMT Host state unknown
Tried installing ViSSL with both direct installation and from source and face the same issue: training not progressing
Thank you for using VISSL :) Sorry for the late answer (I got first COVID then went into 1 month PTO).
So this really looks like an environment issue with distributed training. The initialisation of the distributed group seems to have gone fine, but maybe the test of the distributed training has failed:
In the code of trainer_main.py, there is a call to dist.all_reduce(torch.zeros(1).cuda()) right after the initialisation of the distributed training that we saw in your logs. It might be what is failing but we need to make sure of it to decide on the next steps.
If you installed from source, could you add some logs around the dist.all_reduce(torch.zeros(1).cuda()) in the setup_distributed function of the trainer_main.py? Could you also add some logs in the following places:
before and after self.task = build_task(self.cfg) in trainer_main.py
before and after self.task.init_distributed_data_parallel_model() in trainer_main.py
And then re-run your exact command to check what we get.
Thank you for this repository
I'm trying to train Deep Cluster V2 on custom dataset that I successfully registered. However when I initiate training, the output logs are stuck at
initialized host ... as rank 0
. There's no error but no progress either.Please help. I've attached output logs below -
Instructions To Reproduce the Issue:
git diff
)No Changes
sample_crowley_passport
is the registered custom datasetNothing after this
Expected behavior:
I'd expect the model to train or code to throw an error, if any.
If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.
If you expect the model to converge / work better, note that we do not give suggestions
on how to train a new model.
Only in one of the two conditions, we will help with it:
(1) You're unable to reproduce the results in vissl model zoo.
(2) It indicates a vissl bug.
Environment:
Provide your environment information using the following command:
Tried installing ViSSL with both direct installation and from source and face the same issue: training not progressing
If your issue looks like an installation issue / environment issue,
please first try to solve it with the instructions in
https://github.com/facebookresearch/vissl/tree/main/docs
The text was updated successfully, but these errors were encountered: