add zero optimizer parallel #593

CaitinZhao · 2024-07-12T06:20:34Z

What does this PR do?

参考deepspeed的zero，在MindSpore的数据并行模式下实现优化器并行算法

Fixes # (issue)

Adds # (feature)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
documentation guidelines
Did you build and run the code without any errors?
Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

SamitHuang · 2024-07-13T09:05:15Z

tests/st/test_zero.py

+        print(f"rank_id: {rank_id}, group_size: {group_size}")
+        ms.reset_auto_parallel_context()
+        ms.set_auto_parallel_context(
+            parallel_mode=ms.ParallelMode.DATA_PARALLEL,


How is the difference between it ParallelMode.SEMI_AUTO_PARALLEL?

Implemented a new version of zero optimizer parallel in DATA_PARALLEL refer to DeepSpeed. Not use the MindSpore automatic parallel process.

SamitHuang

Suggestion: supplement the script for checkpoint merging.

zhtmike · 2024-09-03T07:11:27Z

mindone/trainers/train_step.py

@@ -104,6 +115,10 @@ def construct(self, *inputs):

        # 1. compute gradients (of the up-scaled loss w.r.t. the model weights)
        grads = self.grad(self.network, weights)(*inputs, scaling_sens_filled)
+
+        # Gradient communication
+        grads = self.zero_helper.cal_gradients(grads)


self.zero_helper can be None, need to to add a if condition

CaitinZhao requested a review from vigo999 as a code owner July 12, 2024 06:20

CaitinZhao requested review from SamitHuang, zhanghuiyao and geniuspatrick July 12, 2024 09:38

SamitHuang reviewed Jul 13, 2024

View reviewed changes

CaitinZhao force-pushed the master branch 8 times, most recently from 0e46851 to c0a8757 Compare July 17, 2024 03:36

zhtmike mentioned this pull request Aug 6, 2024

Os v1.2 / VAE support SP #604

Merged

6 tasks

CaitinZhao force-pushed the master branch 2 times, most recently from 2755198 to dc2d467 Compare August 23, 2024 01:34

CaitinZhao force-pushed the master branch 2 times, most recently from e4978fe to 69cce29 Compare September 3, 2024 03:57

SamitHuang approved these changes Sep 3, 2024

View reviewed changes

zhtmike reviewed Sep 3, 2024

View reviewed changes

CaitinZhao force-pushed the master branch 2 times, most recently from 60c7aaa to a19b806 Compare September 3, 2024 08:25

zhaoting added 8 commits September 10, 2024 09:34

add zero optimizer parallel

9b6b930

code check

c45b74d

add some comments

a317882

update

516731c

add some info

73d3636

zero helper

d7e3044

bug fix

09c7674

reconstruct

b7d0ba9

zhaoting added 7 commits September 10, 2024 09:34

comm fusion

5d02f55

ema update

bc170db

update

bbbcee9

update

1ff69d7

update

9b628d2

update

4c0cee8

checkpoint merging

6ec819f

CaitinZhao force-pushed the master branch from a19b806 to 6ec819f Compare September 10, 2024 01:34

zhaoting added 2 commits September 10, 2024 10:46

fix bug

ff0a5fb

fix bug

f39a952

CaitinZhao force-pushed the master branch 2 times, most recently from d9b877e to bb21538 Compare September 10, 2024 09:07

fix bug

ec002ae

CaitinZhao force-pushed the master branch from bb21538 to ec002ae Compare September 10, 2024 09:12

vigo999 approved these changes Sep 11, 2024

View reviewed changes

vigo999 added this pull request to the merge queue Sep 11, 2024

Merged via the queue into mindspore-lab:master with commit 5831703 Sep 11, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add zero optimizer parallel #593

add zero optimizer parallel #593

CaitinZhao commented Jul 12, 2024 •

edited

Loading

SamitHuang Jul 13, 2024

CaitinZhao Jul 15, 2024

SamitHuang left a comment

zhtmike Sep 3, 2024

CaitinZhao Sep 3, 2024

add zero optimizer parallel #593

add zero optimizer parallel #593

Conversation

CaitinZhao commented Jul 12, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

SamitHuang Jul 13, 2024

Choose a reason for hiding this comment

CaitinZhao Jul 15, 2024

Choose a reason for hiding this comment

SamitHuang left a comment

Choose a reason for hiding this comment

zhtmike Sep 3, 2024

Choose a reason for hiding this comment

CaitinZhao Sep 3, 2024

Choose a reason for hiding this comment

CaitinZhao commented Jul 12, 2024 •

edited

Loading