Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training #159

Closed
zouharvi opened this issue Aug 5, 2023 · 10 comments
Closed

Multi-GPU training #159

zouharvi opened this issue Aug 5, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@zouharvi
Copy link

zouharvi commented Aug 5, 2023

I am attempting to run comet-train with multiple GPUs.

Command (abbreviated):

CUDA_VISIBLE_DEVICES=0,1,2,3 comet-train ...

Config (abbreviated):

init_args:
  accelerator: gpu
  devices: 4
  auto_scale_batch_size: True
  auto_select_gpus: True

Output with error (abbreviated):

...
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
...
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
...
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
...
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
...
1,161.732 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
Traceback (most recent call last):
  File "/opt/conda/bin/comet-train", line 8, in <module>
    sys.exit(train_command())
  File "/opt/conda/lib/python3.10/site-packages/comet/cli/train.py", line 209, in train_command
    trainer.fit(model)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
    self._run_sanity_check()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
    val_loop.run()
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
    self._data_fetcher = iter(data_fetcher)
  File "/opt/conda/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
    self.dataloader_iter = iter(self.dataloader)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
    w.start()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'CometModel.val_dataloader.<locals>.<lambda>'

I'm using NVIDIA A10G GPUs and the following software versions:

  • Python - 3.10.9
  • COMET - upstream
  • torch - 2.0.1
  • pytorch-lightning - 1.9.5
  • transformers - 4.29.0
  • numpy - 1.24.3
@zouharvi zouharvi added the bug Something isn't working label Aug 5, 2023
@maxiek0071
Copy link

Hi all, I would like to confirm this, I have the same issue with the above technology stack.

@BramVanroy
Copy link
Contributor

Hey @zouharvi @maxiek0071. Can you try the linked PR and let me know if that works (if it does not, post the error trace)?

You can install it like this:

python -m pip install git+https://github.com/Unbabel/COMET.git@refs/pull/160/head

@maxiek0071
Copy link

Hi @BramVanroy,

I install comet from the branch you specified, and now I'm getting a similar error for EvaluationLoop.

image

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Added key: store_based_barrier_key:1 to store for rank: 1
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Added key: store_based_barrier_key:1 to store for rank: 2
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Added key: store_based_barrier_key:1 to store for rank: 3
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name                | Type               | Params
-----------------------------------------------------------
0 | encoder             | XLMREncoder        | 558 M
1 | layerwise_attention | LayerwiseAttention | 26
2 | train_metrics       | RegressionMetrics  | 0
3 | val_metrics         | ModuleList         | 0
4 | estimator           | FeedForward        | 10.5 M
-----------------------------------------------------------
10.5 M    Trainable params
558 M     Non-trainable params
569 M     Total params
1,138.661 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:208: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Traceback (most recent call last):
  File "/home/ubuntu/venv-comet-3.10/bin/comet-train", line 8, in <module>
    sys.exit(train_command())
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/comet/cli/train.py", line 192, in train_command
    trainer.fit(model)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 139, in _wrapping_function
    results = function(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run
    results = self._run_stage()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage
    self._run_train()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1204, in _run_train
    self._run_sanity_check()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1276, in _run_sanity_check
    val_loop.run()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
    self.advance(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/loop.py", line 194, in run
    self.on_run_start(*args, **kwargs)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 84, in on_run_start
    self._data_fetcher = iter(data_fetcher)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/pytorch_lightning/utilities/fetching.py", line 178, in __iter__
    self.dataloader_iter = iter(self.dataloader)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/ubuntu/venv-comet-3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/local/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'EvaluationLoop.advance.<locals>.batch_to_device'

I suppose further adjustments are necessary.

@BramVanroy
Copy link
Contributor

BramVanroy commented Aug 8, 2023

@maxiek0071 I've been looking at this over lunch and I have made some progress, but not enough I believe. I do not have the time/patience currently to dig deeper into the idiosyncracies of PyTorch Lightning (where the issue lies) but I've written about what the issue is and what some one with more experience can do to fix it, in this PR: #160 (comment)

So perhaps if you share that PR with your network, other people may chime in and we can quickly solve it. But I cannot dig deeper into this for now, sorry! Maybe @ricardorei has some ideas.

@maxiek0071
Copy link

Thanks @BramVanroy for your help, I appreciate it! I will first evaluate how not using encoder fine-tuning impacts the QE quality (#158). If the speed stays at 13.98it/s throughout training, it takes about 12-15h for 3-4 epochs for me.

Could @ricardorei confirm that they executed comet-train on multi-GPUs? What was their Python venv and CUDA version?

@ricardorei
Copy link
Collaborator

Hi all! I'll look into this today.

I had this fixed before but Pytorch-Lightning likes to changes things. Maybe its just a quick fix... Like Bram said in his PR I think the problem is with torchmetrics.

ricardorei pushed a commit that referenced this issue Sep 18, 2023
ricardorei pushed a commit that referenced this issue Sep 18, 2023
@ricardorei
Copy link
Collaborator

I updated lightning and metrics and I tested multGPU training and it was working. I used strategy: ddp and devices: 2 and everything went well.

Please give it a try

@ricardorei
Copy link
Collaborator

Use the latest version 2.1.0

@maxiek0071
Copy link

Hi @ricardorei, I have just checked with this version, and I can execute training on multiple GPUs.
Thank you for your help!

@zouharvi
Copy link
Author

Thanks, @ricardorei 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants