Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP degrades the performance #8

Open
prote376 opened this issue Aug 4, 2022 · 10 comments
Open

DDP degrades the performance #8

prote376 opened this issue Aug 4, 2022 · 10 comments

Comments

@prote376
Copy link

prote376 commented Aug 4, 2022

Thank you for sharing this code!

I am testing your code for multitask video with BART on 24GB GPUs.
To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)

bash scripts/video/single_adapter.sh 2

However, it showed worse results than the performance on a single 48GB GPU.
When I increased the number of GPUs, the performance was getting worse.
Because the model doesn't have BatchNorm, I thought the performance should be similar.

Have you tried DDP? Or do you have any intuition about the problem?

@luogen1996
Copy link

Thank you for sharing this code!

I am testing your code for multitask video with BART on 24GB GPUs. To run your code on 24GB GPUs, I used below command to enable DDP. (batch size:50 -> 25)

bash scripts/video/single_adapter.sh 2

However, it showed worse results than the performance on a single 48GB GPU. When I increased the number of GPUs, the performance was getting worse. Because the model doesn't have BatchNorm, I thought the performance should be similar.

Have you tried DDP? Or do you have any intuition about the problem?

I have the same problem. When running this code on 8 x V100 (16G), I got :
VQA
Epoch 19: Valid Raw 62.55 Topk 62.49
Epoch 19: Best Raw 62.55
GQA
Epoch 19: Valid 51.76
Epoch 18: Best 51.86
NLVR
Epoch 19: Valid 69.21
Epoch 19: Best 69.21
COCO Caption
Epoch 19: Valid CIDEr 111.52
Epoch 19: Best 111.52

@ylsung
Copy link
Owner

ylsung commented Sep 26, 2022

Thanks for pointing out the issue. I remember I didn't have this issue when I tried DDP. I will check on this soon.

@ylsung
Copy link
Owner

ylsung commented Oct 28, 2022

I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.

@luogen1996
Copy link

I just found out that DDP works well with full fine-tuning but works worse with parameter-efficient transfer learning methods. I will further investigate this issue soon.

If DDP does not work well, can i reproduce the results on a single A100 (40GB) by reducing the batch size. From @prote376 , it seems also not work well. How should I address this problem? Thanks!

@ylsung
Copy link
Owner

ylsung commented Dec 13, 2022

I think reduce the batch size should work, but the learning rate might need to reduce accordingly. The performance drops from @prote376 experiments may still come from the multi-gpu problem not the batch size.

@czy-orange
Copy link

The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.

My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is ddp_model(input) while in this repo the case is ddp_model.module.train_step()).

@yushuinanrong
Copy link

@ylsung Any update regarding this issue?

@JieShibo
Copy link

JieShibo commented Sep 12, 2023

The problem may come with DDP settings. The PyTorch DDP notes about Backward Pass “so after the backward pass, the grad field on the same corresponding parameter across different DDP processes should be the same”, however, I printed the grads and found out that they were not the same across GPUs.

My guess is that the synchronization of grads is missing (or some other misplays that possibly cause the desynchronization), and the way to use DDP model could be the source (official use is ddp_model(input) while in this repo the case is ddp_model.module.train_step()).

You're right. ddp_model.module.train_step() does not synchronize the gradients. The issue can be solved by manually synchronizing them.

@nbasyl
Copy link

nbasyl commented Sep 15, 2023

@JieShibo Hi, could you elaborate more on how to manually synchronize the gradient and address the synchronization problem? Thank you so much!

@JieShibo
Copy link

@nbasyl
PyTorch 2.0.1

from torch.distributed.algorithms.join import (
    Join,
    Joinable,
    JoinHook,
)
from torch.distributed.utils import (
    _verify_param_shape_across_processes,
    _sync_module_states,
    _to_kwargs,
)
from torch.nn.parallel.distributed import _find_tensors, _tree_flatten_with_rref, _DDPSink, _tree_unflatten_with_rref

def ddp_forward(self, *inputs, **kwargs):
    with torch.autograd.profiler.record_function(
        "DistributedDataParallel.forward"
    ):
        if torch.is_grad_enabled() and self.require_backward_grad_sync:
            assert self.logger is not None
            self.logger.set_runtime_stats_and_log()
            self.num_iterations += 1
            self.reducer.prepare_for_forward()

        work = Join.notify_join_context(self)
        if work:
            self.reducer._set_forward_pass_work_handle(
                work, self._divide_by_initial_world_size  # type: ignore[arg-type]
            )

        if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
            logger.info(
                "Reducer buckets have been rebuilt in this iteration."
            )
            self._has_rebuilt_buckets = True

        if self._check_sync_bufs_pre_fwd():
            self._sync_buffers()

        if self._join_config.enable:
            self._check_global_requires_backward_grad_sync(
                is_joined_rank=False
            )
        module_to_run = (
            self._replicated_tensor_module
            if self._use_replicated_tensor_module
            else self.module
        )

        if self.device_ids:
            inputs, kwargs = _to_kwargs(
                inputs,
                kwargs,
                self.device_ids[0],
                self.use_side_stream_for_tensor_copies,
            )
            with self._inside_ddp_forward():
                output = module_to_run.train_step(*inputs[0], **kwargs[0])  # type: ignore[index]
        else:
            with self._inside_ddp_forward():
                output = module_to_run.train_step(*inputs, **kwargs)

        if self._check_sync_bufs_post_fwd():
            self._sync_buffers()

        if torch.is_grad_enabled() and self.require_backward_grad_sync:
            self.require_forward_param_sync = True
            if self.find_unused_parameters and not self.static_graph:
                self.reducer.prepare_for_backward(
                    list(_find_tensors(output))
                )
            else:
                self.reducer.prepare_for_backward([])
        else:
            self.require_forward_param_sync = False

    if (self.find_unused_parameters and not self.static_graph) or (
        self.static_graph and self.num_iterations == 1
    ):
        state_dict = {
            "static_graph": self.static_graph,
            "num_iterations": self.num_iterations,
        }

        (
            output_tensor_list,
            treespec,
            output_is_rref,
        ) = _tree_flatten_with_rref(output)
        output_placeholders = [None for _ in range(len(output_tensor_list))]
        for i, output in enumerate(output_tensor_list):
            if torch.is_tensor(output) and output.grad_fn is None:
                output_placeholders[i] = output

        passthrough_tensor_list = _DDPSink.apply(
            self.reducer,
            state_dict,
            *output_tensor_list,
        )
        for i in range(len(output_placeholders)):
            if output_placeholders[i] is None:
                output_placeholders[i] = passthrough_tensor_list[i]

        output = _tree_unflatten_with_rref(
            output_placeholders, treespec, output_is_rref
        )
    return output

if self.args.fp16 and _use_native_amp:
with autocast():
if self.args.distributed:
results = self.model.module.train_step(batch)
else:
results = self.model.train_step(batch)
else:
if self.args.distributed:
results = self.model.module.train_step(batch)
else:
results = self.model.train_step(batch)

self.model.module.train_step(batch) -> ddp_forward(self.model, batch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants