Distributed SaveLoad implementation for semi-auto strategy #59659

pangengzheng · 2023-12-04T08:00:27Z

PR types

Others

PR changes

Others

Description

card-78318
Design the save_state_dict and load_state_dict api to support save and load checkpoint of dynamic and static graph semi-auto distributed training.

… develop

… dist_save_load

paddle-bot · 2023-12-04T08:00:32Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

… dist_save_load

zhiqiu · 2023-12-05T12:50:24Z

python/paddle/distributed/checkpoint/save_state_dict.py

+        paddle.distributed.get_world_size() > 1 or coordinator_rank != 0
+    ):
+        raise ValueError(
+            f"use_dist is False, please set coordinator_rank to 0 and paddle.distributed.get_world_size() to 1, world_size:{paddle.distributed.get_world_size()}, coordinator_rank:{coordinator_rank}"


Why not allow use_dist=false and world_size > 1?

use_dist是针对单卡的情况的，但貌似不需要用户指定，在内部通过use_dist=True if world_size>1 else False来确定就行。save_state_dict的设计是导出当前训练时候分布式策略下的模型，如果当前是分布式的就导出分布式的，如果是单卡的就导出单卡的，不支持直接在分布式的情况下导出单卡模型，如果需要导出单卡模型，需要先定义单卡模型，用load_state_dict加载再用save_state_dict导出即可

zhiqiu · 2023-12-05T12:53:02Z

python/paddle/distributed/checkpoint/utils.py

+    return tuple(local_shape), tuple(global_offset)
+
+
+def flatten_state_dict(state_dict):


WHY return directly?

是个TODO，为了支持state_dict={"model":model.state_dict(), "optimizer":optimizer.state_dict()}这种情况，但目前还未实现，先不对传入的state_dict进行操作

python/paddle/distributed/checkpoint/save_state_dict.py

zhiqiu · 2023-12-05T13:28:00Z

python/paddle/distributed/checkpoint/save_state_dict.py

+    if coordinator_rank == paddle.distributed.get_rank():
+        logger.debug(f"metadata:{metadata}")
+        paddle.save(metadata, os.path.join(path, f"{unique_id}.metadata"))


why not save meta on all ranks?

meta是global的，每个rank上是一样的，只需要保存一份

我明白，每个rank都save是不是方便调试，不必都找rank 0？meta 也不占很多空间。

这个可能不行，因为每个机器都有多个卡，多个卡同时写一个文件可能会出问题，导致写入的内容不符合预期

zhiqiu · 2023-12-05T13:33:49Z

python/paddle/distributed/checkpoint/metadata.py

+    The identifier of a local tensor.
+    """
+
+    tensor_id: str


tensor_name or tensor_key ?

tensor_name貌似不太合适，这个是个标识，在动半中是structure_name，在静半中是tensor的名字。叫tensor_key与tensor_id的意思类似，也是可以的，如果觉得tensor_key更合适，可更改

嗯嗯，在state_dict中就是key吧

zhiqiu · 2023-12-05T13:35:52Z

python/paddle/distributed/checkpoint/load_state_dict.py

+                local_tensor_index not in tensor_id_list
+            ), f"Duplicate tensor_id:{local_tensor_index} found. Check whether the metadata_file:{metadata_file} contains the same tensor metadata."
+            tensor_id_list.append(local_tensor_index.tensor_id)
+            if local_tensor_index.tensor_id in state_dict:


The state_dict is local_state_dict?

这个state_dict是每个rank自己维护的那个，是local的

zhiqiu · 2023-12-05T13:40:22Z

python/paddle/distributed/checkpoint/load_state_dict.py

+    for rank, local_files in enumerate(global_data_files):
+        if len(local_files) > 0:
+            local_files = [
+                f for f in local_files if f in necessary_data_files_set


When does local_files differ from necessary_data_files_set?

necessary_data_files_set是指当前state_dict的key命中的所有需要的文件，这些文件可能分布在其他rank上，local_files这里是个list，确实包含了所有rank可以读到的文件总和，但是不排除这些可以读到的文件总和是大于state_dict所需要读到的数据文件的，所以这里做了一个过滤的逻辑，只处理需要用到的文件

如果大于，是需要报warning吗？还是本来就合理。

大于的话没有关系，不需要warning，因为不影响当前参数的加载

chenwhql · 2023-12-06T02:44:58Z

python/paddle/distributed/checkpoint/__init__.py

@@ -0,0 +1,21 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.


chenwhql · 2023-12-06T02:46:01Z

python/paddle/distributed/checkpoint/load_state_dict.py

@@ -0,0 +1,497 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


chenwhql · 2023-12-06T02:48:24Z

python/paddle/distributed/checkpoint/load_state_dict.py

+                if f not in file_to_ranks:
+                    file_to_ranks[f] = []
+                file_to_ranks[f].append(r)
+        logger.info(f"file_to_ranks:{file_to_ranks}")


logger系列调试信息后续会清理吗？如果不清理建议规范化一下

会打算在最后合入前统一清理，如果规范化的话，是有指定格式吗

python/paddle/distributed/checkpoint/load_state_dict.py

chenwhql · 2023-12-06T02:53:51Z

python/paddle/distributed/checkpoint/metadata.py

@@ -0,0 +1,42 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.


整体check一下吧，年份都不对

… dist_save_load

zhiqiu · 2023-12-06T08:46:52Z

test/auto_parallel/hybrid_strategy/semi_auto_parallel_simple_net_dp_mp.py

+            v._local_value().add_(paddle.ones_like(v._local_value()))
+        paddle.distributed.load_state_dict(state_dict, ckpt_path())
+        for k, v in state_dict.items():
+            assert k in local_state_dict, k


what is the last k used for

最后那个k是打印内容，assert用法是assert condition, error_message

zhiqiu · 2023-12-06T08:48:04Z

test/auto_parallel/hybrid_strategy/semi_auto_parallel_simple_net_dp_mp_pp.py

+            assert k in local_state_dict, k
+            if v._is_initialized():
+                self.check_tensor_eq(v._local_value(), local_state_dict[k])
+        os.system(f"rm -rf {ckpt_path()}")


use tempfile.TemporaryDirectory(), you can find examples in other ut.

pangengzheng · 2023-12-06T12:42:15Z

中文api文档PR: PaddlePaddle/docs#6355

zhiqiu

LGTM

risemeup1

LGTM

XieYunshen

LGTM
单测超时时间设置

jeff41404 · 2023-12-07T04:24:33Z

python/paddle/distributed/checkpoint/__init__.py

+__all__ = [
+    "save_state_dict",
+    "load_state_dict",
+]


Only add API in list of __ all__ at recommended user path, as we recommend using paddle.distributed.save_state_dict and paddle.distributed.load_state_dict, there is no need to add them to this list. import above can be retained.

jeff41404 · 2023-12-07T04:34:48Z

python/paddle/distributed/checkpoint/load_state_dict.py

+def load_state_dict(
+    state_dict,
+    path,
+    process_group=None,
+    coordinator_rank=0,
+) -> None:


I saw in the design document that there is parameter of use_dist. Shall we need to implement use_dist which is not implemented here? If not, please explain the reason and modify the design document.

python/paddle/distributed/checkpoint/save_state_dict.py

XiaoguangHu01

LGTM

sunzhongkai588

API 文档请参考英文模板，务必注意空行和缩进

sunzhongkai588 · 2023-12-07T06:57:48Z

python/paddle/distributed/checkpoint/save_state_dict.py

+        coordinator_rank(int): The rank used to save non distributed values. Rank0 is used by default.
+
+    Examples:
+        .. code-block:: python


Suggested change

.. code-block:: python

.. code-block:: python

code-block下方得加空行，否则官网渲染会出错

sunzhongkai588 · 2023-12-07T06:58:49Z

python/paddle/distributed/checkpoint/save_state_dict.py

+
+    Examples:
+        .. code-block:: python
+            >>> # doctest: +SKIP('Save state dict.')


Suggested change

>>> # doctest: +SKIP('Save state dict.')

>>> # doctest: +SKIP('state dict not exist'')

跳过检查的原因写清晰一点叭，保证可读性

sunzhongkai588 · 2023-12-07T07:01:31Z

python/paddle/distributed/checkpoint/load_state_dict.py

+) -> None:
+    """
+    Load the state_dict inplace from a checkpoint path.
+    Args:


Suggested change

Args:

Args:

声明、参数..等各部分之间加空行，否则可能会导致官网渲染出错

sunzhongkai588 · 2023-12-07T07:02:38Z

python/paddle/distributed/checkpoint/load_state_dict.py

+    Example:
+        .. code-block:: python


Suggested change

Example:

.. code-block:: python

Example:

.. code-block:: python

同理

sunzhongkai588 · 2023-12-07T07:03:46Z

python/paddle/distributed/checkpoint/load_state_dict.py

+        coordinator_rank(int): The rank used to coordinate the checkpoint. Rank0 is used by default.
+    Example:
+        .. code-block:: python
+            >>> # doctest: +SKIP('Load state dict.')


Suggested change

>>> # doctest: +SKIP('Load state dict.')

>>> # doctest: +SKIP('state dict not exist')

理由写清晰一点，保证可读性

sunzhongkai588

LGTM，先合入，后续进行相关修改

pangengzheng added 29 commits June 25, 2023 15:36

exclude xpu

fc3b3c0

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e291552

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7a13c0b

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d81f305

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cd6e4fb

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

9d27f27

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5037694

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ef695ee

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

23aa6ff

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f7615b7

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

6605dff

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

767835d

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f756bc6

… develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2ffd709

… develop

demo of running dygraph distributed save load

738f5d5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

f3d4bb2

… dist_save_load

support save cross mesh state_dict

7134583

polish

9e2094a

fix compute overlap bug

786a318

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ef4f374

… dist_save_load

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

058d5fe

… dist_save_load

test save load in dp_mp unittest

8f64e81

fix get local file bug and test

250b1b7

delete useless files, and rename var

bd9348f

polish

ecee68b

format codes

a8491b9

merge develop

867726d

test use_dist

2bf30c5

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b46042c

… dist_save_load

pangengzheng added 3 commits December 5, 2023 16:57

fix coverage ci

e0d0690

fix docstring codes

18298b9

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

13b1d07

… dist_save_load

zhiqiu reviewed Dec 5, 2023

View reviewed changes

chenwhql reviewed Dec 6, 2023

View reviewed changes

pangengzheng added 5 commits December 6, 2023 11:24

rename and codestyle

1dcd0a7

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

00df8ba

… dist_save_load

get rid of use_dist argument

c728400

fix copyright

a3125c0

polish doc

0543d1f

zhiqiu reviewed Dec 6, 2023

View reviewed changes

pangengzheng added 3 commits December 6, 2023 19:04

polish

e4c72cd

polish

0561180

use tmp file path

4df7f76

zhiqiu approved these changes Dec 7, 2023

View reviewed changes

chenwhql approved these changes Dec 7, 2023

View reviewed changes

risemeup1 approved these changes Dec 7, 2023

View reviewed changes

XieYunshen approved these changes Dec 7, 2023

View reviewed changes

tianshuo78520a approved these changes Dec 7, 2023

View reviewed changes

pangengzheng changed the title ~~Dist save load~~ Distributed SaveLoad implementation for semi-auto strategy Dec 7, 2023

jeff41404 reviewed Dec 7, 2023

View reviewed changes

python/paddle/distributed/checkpoint/save_state_dict.py Show resolved Hide resolved

jeff41404 approved these changes Dec 7, 2023

View reviewed changes

XiaoguangHu01 approved these changes Dec 7, 2023

View reviewed changes

sunzhongkai588 reviewed Dec 7, 2023

View reviewed changes

sunzhongkai588 approved these changes Dec 7, 2023

View reviewed changes

zhiqiu merged commit a2c8c9a into PaddlePaddle:develop Dec 7, 2023

sunzhongkai588 mentioned this pull request Dec 13, 2023

Fix state dict api en doc #59793

Merged

		return tuple(local_shape), tuple(global_offset)


		def flatten_state_dict(state_dict):

		@@ -0,0 +1,21 @@
		# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,497 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

		@@ -0,0 +1,42 @@
		# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.

	>>> # doctest: +SKIP('Save state dict.')
	>>> # doctest: +SKIP('state dict not exist'')

	>>> # doctest: +SKIP('Load state dict.')
	>>> # doctest: +SKIP('state dict not exist')

Distributed SaveLoad implementation for semi-auto strategy #59659

Distributed SaveLoad implementation for semi-auto strategy #59659

Conversation

pangengzheng commented Dec 4, 2023 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Dec 4, 2023

Choose a reason for hiding this comment

pangengzheng Dec 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pangengzheng commented Dec 6, 2023

zhiqiu left a comment

Choose a reason for hiding this comment

risemeup1 left a comment

Choose a reason for hiding this comment

XieYunshen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

pangengzheng commented Dec 4, 2023 •

edited

Loading

pangengzheng Dec 6, 2023 •

edited

Loading