Optimize zero3 fetch params using all_reduce #5420

deepcharm · 2024-04-16T14:44:11Z

Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required.
Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated.
The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true }
By default the optimization is not enabled.

* Use all_reduce instead of all_gather to fetch module parameters. This reduces overhead of concatenation and slicing, which are no longer required. * All tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled.

tjruwase · 2024-04-16T14:48:44Z

@deepcharm, thanks for this interesting approach. Can you share some observed performance gains?

deepcharm · 2024-04-16T15:08:41Z

@deepcharm, thanks for this interesting approach. Can you share some observed performance gains?

@tjruwase We have observed around 9% performance gain on HPU in BERT workloads.

GuanhuaWang · 2024-04-16T17:23:13Z

Hi @deepcharm

Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version to remove the overhead of concatenation and slicing?

deepcharm · 2024-04-18T16:18:29Z

Hi @deepcharm

Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version to remove the overhead of concatenation and slicing?

Hi @GuanhuaWang, you're right the proposed approach indeed adds some communication overhead. The main idea is to re-arrange the layout of the sharded pieces in the flat buffer to achieve overall perf boost.

Hopefully, the attached slides below help clarify the benefits (less Host side overhead, smaller memory peak, etc).
Please let me know if that answers your questions.

1) Current Approach

2) Proposed Optimization

3) Comparison

GuanhuaWang · 2024-04-23T01:53:48Z

Hi @deepcharm
Thx for the PR. Just curious why allreduce could be faster than allgather? allreduce basically is doing reduce-scatter + all-gather. Could we just make allgather as coalesced version to remove the overhead of concatenation and slicing?

Hi @GuanhuaWang, you're right the proposed approach indeed adds some communication overhead. The main idea is to re-arrange the layout of the sharded pieces in the flat buffer to achieve overall perf boost.

Hopefully, the attached slides below help clarify the benefits (less Host side overhead, smaller memory peak, etc). Please let me know if that answers your questions.

1) Current Approach

2) Proposed Optimization

3) Comparison

Hi @deepcharm , these slides are cool and make sense to me. But as 2) Proposed Optimization, it showed removing unnecessay data concat&copy by avoiding params interleaving of allgather (Not allreduce). Allreduce is what confuses me, we don't do any sum/avg operation on collected weights right?

deepspeed/runtime/zero/partition_parameters.py

tjruwase · 2024-04-23T13:47:24Z

@deepcharm, I was not aware that narrow, cat, copy operations on device tensors incurred high CPU overhead. I will like to learn more. Can you share the reason? How did you discover this? Can you share some repro/test code for this? Thanks!

deepcharm · 2024-05-02T11:00:50Z

@deepcharm, I was not aware that narrow, cat, copy operations on device tensors incurred high CPU overhead. I will like to learn more. Can you share the reason? How did you discover this? Can you share some repro/test code for this? Thanks!

@tjruwase, we've seen this phenomenon in large models where looping over many params causes significant CPU overhead.
Possibly this issue is more specific for accelerators such as HPU.
We will create a repro script and share with you.

tjruwase · 2024-05-07T13:42:58Z

@tjruwase, we've seen this phenomenon in large models where looping over many params causes significant CPU overhead.
Possibly this issue is more specific for accelerators such as HPU.
We will create a repro script and share with you.

@deepcharm, very interesting, thanks for the explanation. I look forward to learning more from the repro script. I think it might be a great documentation for performance debugging of zero3 on accelerators.

deepcharm · 2024-05-09T12:29:19Z

Hi @tjruwase, for some reason the PR has been removed from the merge-queue. Can you please re-add it? Thanks

* Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled. Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

deepcharm requested review from tjruwase and mrwyattii as code owners April 16, 2024 14:44

tjruwase requested review from GuanhuaWang and tohtana and removed request for mrwyattii April 16, 2024 14:48

Merge branch 'master' into use-all-reduce-for-fetch-params

ec511f4

Merge branch 'master' into use-all-reduce-for-fetch-params

cde67ab

tjruwase reviewed Apr 23, 2024

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Show resolved Hide resolved

tjruwase reviewed Apr 23, 2024

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Show resolved Hide resolved

Merge branch 'master' into use-all-reduce-for-fetch-params

19787fa

Merge branch 'master' into use-all-reduce-for-fetch-params

327bda8

tjruwase approved these changes May 6, 2024

View reviewed changes

tjruwase added this pull request to the merge queue May 7, 2024

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks May 7, 2024

tjruwase added this pull request to the merge queue May 13, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 13, 2024

tjruwase added this pull request to the merge queue May 20, 2024

Merged via the queue into microsoft:master with commit 49df8d8 May 20, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize zero3 fetch params using all_reduce #5420

Optimize zero3 fetch params using all_reduce #5420

deepcharm commented Apr 16, 2024

tjruwase commented Apr 16, 2024

deepcharm commented Apr 16, 2024

GuanhuaWang commented Apr 16, 2024 •

edited

Loading

deepcharm commented Apr 18, 2024