[Bug] Distributed training with MLflowVisBackend crashes on multiple close() called #1144

zimonitrome · 2023-05-12T12:37:28Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.16 (default, Mar 2 2023, 03:21:46) [GCC 11.2.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0,1', 'NVIDIA GeForce RTX 4070 Ti'), ('CUDA_HOME', None), ('GCC', 'gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0'), ('PyTorch', '2.0.0'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.13.1a0'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.3')])

Reproduces the problem - code sample

Add MLflowVisBackend to config file:

visualizer = dict(
    type="SegLocalVisualizer", vis_backends=[dict(type="MLflowVisBackend", save_dir="mlruns", artifact_suffix=(".json", ".py", "yaml", ".pth"))]
)

Reproduces the problem - command or script

      python -m torch.distributed.launch \
          --nnodes=1 \
          --node_rank=0 \
          --master_addr=127.0.0.1 \
          --nproc_per_node=2 \
          --master_port=29500 \
          ./train.py \
          {config}

Reproduces the problem - error message

First traceback is the code, remaining tracebacks are from torch.distributed.launch:

Traceback (most recent call last):
  File "./train.py", line 131, in <module>
    main()
  File "./train.py", line 125, in main
    runner.train()
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1722, in train
    self.call_hook('after_run')
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1783, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/hooks/logger_hook.py", line 325, in after_run
    runner.visualizer.close()
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/visualization/visualizer.py", line 1147, in close
    vis_backend.close()
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mmengine/visualization/vis_backend.py", line 790, in close
    for filename in scandir(self.cfg.work_dir, self._artifact_suffix,
AttributeError: 'MLflowVisBackend' object has no attribute 'cfg'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 260196) of binary: /home/sa/anaconda3/envs/openmmlab/bin/python
Traceback (most recent call last):
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
./train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-12_14:17:13
  host      : vml.global.hvwan.net
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 260196)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "launch_training_local_multi_gpu.py", line 7, in <module>
    local_env_run = mlflow.projects.run(
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mlflow/projects/__init__.py", line 355, in run
    _wait_for(submitted_run_obj)
  File "/home/sa/anaconda3/envs/openmmlab/lib/python3.8/site-packages/mlflow/projects/__init__.py", line 372, in _wait_for
    raise ExecutionException("Run (ID '%s') failed" % run_id)
mlflow.exceptions.ExecutionException: Run (ID '8e32713fe4734fde97c8ed10380bf235') failed

Additional information

I tried to run the default mmsegmentation distributed training script with an added configuration to log for MLflow.

Expected behavior is to run correctly. As is the case if running with only CPU or only a single GPU.
The dataset used was the official Cityscapes dataset with the default dataset loader in mmsegmentation.
The problem appears to be in the close() method of MLflowVisBackend (found here: vis_backend.py). When running multiple instances (such as in distributed training) close() is called multiple times. On the second time, self.cfg is no longer defined and the script crashes.

The text was updated successfully, but these errors were encountered:

feivelliu · 2023-05-14T12:20:50Z

I also encountered the same problem.

HAOCHENYE · 2023-05-15T13:00:18Z

Thanks for your feedback, we've put this issue to our collaboration tasks, and I belive it will be fixed sooner! We also welcome you to submit a PR to help us fix this issue. Looking forward to your contribution 😄 !

zimonitrome · 2023-05-17T06:22:08Z

@HAOCHENYE I created a PR for a pretty simple fix that I have tested quite extensively. I can't really see how it would cause any breaking but it would be nice if someone else could test it too. Maybe @liushea?

zhouzaida · 2023-05-25T15:54:19Z

closed by #1151

zimonitrome added the bug Something isn't working label May 12, 2023

HAOCHENYE mentioned this issue May 15, 2023

MMEngine community collaboration 🚀 ! #916

Open

zimonitrome added a commit to zimonitrome/mmengine that referenced this issue May 17, 2023

[Fix] Close MLflowVisBackend only if active (open-mmlab#1144)

fca4117

zimonitrome mentioned this issue May 17, 2023

[Fix] Close MLflowVisBackend only if active (#1144) #1151

Merged

zhouzaida pushed a commit that referenced this issue May 25, 2023

[Fix] Close MLflowVisBackend only if active (#1144) (#1151)

277b530

zhouzaida closed this as completed May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Distributed training with MLflowVisBackend crashes on multiple close() called #1144

[Bug] Distributed training with MLflowVisBackend crashes on multiple close() called #1144

zimonitrome commented May 12, 2023

feivelliu commented May 14, 2023

HAOCHENYE commented May 15, 2023

zimonitrome commented May 17, 2023

zhouzaida commented May 25, 2023 •

edited

Loading

[Bug] Distributed training with MLflowVisBackend crashes on multiple close() called #1144

[Bug] Distributed training with MLflowVisBackend crashes on multiple close() called #1144

Comments

zimonitrome commented May 12, 2023

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

feivelliu commented May 14, 2023

HAOCHENYE commented May 15, 2023

zimonitrome commented May 17, 2023

zhouzaida commented May 25, 2023 • edited Loading

zhouzaida commented May 25, 2023 •

edited

Loading