Multi-GPU training #159

zouharvi · 2023-08-05T14:25:57Z

I am attempting to run comet-train with multiple GPUs.

Command (abbreviated):

CUDA_VISIBLE_DEVICES=0,1,2,3 comet-train ...

Config (abbreviated):

init_args:
  accelerator: gpu
  devices: 4
  auto_scale_batch_size: True
  auto_select_gpus: True

Output with error (abbreviated):

...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
...
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
...
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
...
1,161.732 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
Traceback (most recent call last):
  File "/opt/conda/bin/comet-train", line 8, in <module>
    sys.exit(train_command())
  File "/opt/conda/lib/python3.10/site-packages/comet/cli/train.py", line 209, in train_command
    trainer.fit(model)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
    self._run_sanity_check()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
    val_loop.run()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
    self._data_fetcher = iter(data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
    self.dataloader_iter = iter(self.dataloader)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CometModel.val_dataloader.<locals>.<lambda>'

I'm using NVIDIA A10G GPUs and the following software versions:

Python - 3.10.9
COMET - upstream
torch - 2.0.1
pytorch-lightning - 1.9.5
transformers - 4.29.0
numpy - 1.24.3

The text was updated successfully, but these errors were encountered:

maxiek0071 · 2023-08-08T07:49:24Z

Hi all, I would like to confirm this, I have the same issue with the above technology stack.

BramVanroy · 2023-08-08T08:20:40Z

Hey @zouharvi @maxiek0071. Can you try the linked PR and let me know if that works (if it does not, post the error trace)?

You can install it like this:

python -m pip install git+https://github.com/Unbabel/COMET.git@refs/pull/160/head

maxiek0071 · 2023-08-08T09:17:13Z

Hi @BramVanroy,

I install comet from the branch you specified, and now I'm getting a similar error for EvaluationLoop.

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Added key: store_based_barrier_key:1 to store for rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name                | Type               | Params
-----------------------------------------------------------
0 | encoder             | XLMREncoder        | 558 M
1 | layerwise_attention | LayerwiseAttention | 26
2 | train_metrics       | RegressionMetrics  | 0
3 | val_metrics         | ModuleList         | 0
4 | estimator           | FeedForward        | 10.5 M
-----------------------------------------------------------
10.5 M    Trainable params
558 M     Non-trainable params
569 M     Total params
1,138.661 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "/home/ubuntu/venv-comet-3.10/bin/comet-train", line 8, in <module>
    sys.exit(train_command())
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/comet/cli/train.py", line 192, in train_command
    trainer.fit(model)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
    self._run_sanity_check()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
    val_loop.run()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
    self._data_fetcher = iter(data_fetcher)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
    self.dataloader_iter = iter(self.dataloader)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'

I suppose further adjustments are necessary.

BramVanroy · 2023-08-08T11:55:49Z

@maxiek0071 I've been looking at this over lunch and I have made some progress, but not enough I believe. I do not have the time/patience currently to dig deeper into the idiosyncracies of PyTorch Lightning (where the issue lies) but I've written about what the issue is and what some one with more experience can do to fix it, in this PR: #160 (comment)

So perhaps if you share that PR with your network, other people may chime in and we can quickly solve it. But I cannot dig deeper into this for now, sorry! Maybe @ricardorei has some ideas.

maxiek0071 · 2023-08-09T07:03:41Z

Thanks @BramVanroy for your help, I appreciate it! I will first evaluate how not using encoder fine-tuning impacts the QE quality (#158). If the speed stays at 13.98it/s throughout training, it takes about 12-15h for 3-4 epochs for me.

Could @ricardorei confirm that they executed comet-train on multi-GPUs? What was their Python venv and CUDA version?

ricardorei · 2023-08-16T10:49:27Z

Hi all! I'll look into this today.

I had this fixed before but Pytorch-Lightning likes to changes things. Maybe its just a quick fix... Like Bram said in his PR I think the problem is with torchmetrics.

ricardorei · 2023-09-21T21:45:19Z

I updated lightning and metrics and I tested multGPU training and it was working. I used strategy: ddp and devices: 2 and everything went well.

Please give it a try

ricardorei · 2023-09-21T21:45:38Z

Use the latest version 2.1.0

maxiek0071 · 2023-09-27T07:01:52Z

Hi @ricardorei, I have just checked with this version, and I can execute training on multiple GPUs.
Thank you for your help!

zouharvi · 2023-09-27T11:29:40Z

Thanks, @ricardorei 🙂

zouharvi added the bug Something isn't working label Aug 5, 2023

BramVanroy mentioned this issue Aug 8, 2023

Replace lambdas with partials to allow multi-gpu training #160

Closed

ricardorei pushed a commit that referenced this issue Sep 18, 2023

bump torchmetrics (#159)

756ffdd

ricardorei pushed a commit that referenced this issue Sep 18, 2023

bump pytorch lightning to >v2.0 (#159)

dd9bb80

ricardorei closed this as completed Sep 26, 2023

ecroxford mentioned this issue Nov 1, 2023

[QUESTION] Train Your Own Metric #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU training #159

Multi-GPU training #159

zouharvi commented Aug 5, 2023

maxiek0071 commented Aug 8, 2023

BramVanroy commented Aug 8, 2023

maxiek0071 commented Aug 8, 2023

BramVanroy commented Aug 8, 2023 •

edited

Loading

maxiek0071 commented Aug 9, 2023

ricardorei commented Aug 16, 2023

ricardorei commented Sep 21, 2023

ricardorei commented Sep 21, 2023

maxiek0071 commented Sep 27, 2023

zouharvi commented Sep 27, 2023

Multi-GPU training #159

Multi-GPU training #159

Comments

zouharvi commented Aug 5, 2023

maxiek0071 commented Aug 8, 2023

BramVanroy commented Aug 8, 2023

maxiek0071 commented Aug 8, 2023

BramVanroy commented Aug 8, 2023 • edited Loading

maxiek0071 commented Aug 9, 2023

ricardorei commented Aug 16, 2023

ricardorei commented Sep 21, 2023

ricardorei commented Sep 21, 2023

maxiek0071 commented Sep 27, 2023

zouharvi commented Sep 27, 2023

BramVanroy commented Aug 8, 2023 •

edited

Loading