Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

yhcc · 2023-01-14T17:54:09Z

yhcc
Jan 14, 2023

这里https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/train_gpt_demo.py 提供了一个基于Gemini的Zero方案，我尝试改变这个方案来直接支持gpt3级别的模型，但会导致oom问题。我看了下GeminiDDP的源码，似乎里面没有关于如何切分参数到不同卡的部分（不知道我是不是理解错了）。现在想通过Gemini这套方案训练一个GPT3级别的模型有什么推荐的方案吗？

feifeibear · 2023-01-15T03:01:59Z

feifeibear
Jan 15, 2023

shard init初始化原理：ColoInitContext一个参数一个参数初始化，先在所有进程上分配出global tensor，然后再切分成N份，每个进程只保留1/N数据。所以全部初始化完毕，每个进程只用1/N内存。

3 replies

yhcc Jan 15, 2023
Author

我之前一直理解的是ColoInitContext这个是用来初始化tp参数的；所以ColoInitContext这个context也会同时考虑到zero3中每个dp的卡保存一部分参数的逻辑吗？我感觉现在是可以初始化出一个很大的模型，但只要一forward似乎就oom了。

feifeibear Jan 15, 2023

Init过程和并行策略没有任何关系（设计上解耦的），尽管看起来切分和TP相似。
fwd oom应该是因为non model data的加入导致的。
你可以看一下init之后，fwd前的的内存占用分析一下，是不是需要调小batch size，或者试试其他placement policy。

taishiciR Mar 29, 2023

Init过程和并行策略没有任何关系（设计上解耦的），尽管看起来切分和TP相似。

fwd oom应该是因为non model data的加入导致的。
你可以看一下init之后，fwd前的的内存占用分析一下，是不是需要调小batch size，或者试试其他placement policy。
hi,feifeibear:请教一下，如果init一个大尺寸语言模型，参数单GPU加载不下，需要把参数无冗余分配到不同卡上再并行，shardinit是唯一的方式吗？有没有啥好的方法

hi feifeibear

binmakeswell · 2023-03-30T08:03:43Z

binmakeswell
Mar 30, 2023
Maintainer

Hi @yhcc @taishiciR For shard init, you can refer to here
https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/train_gpt_demo.py#L229
We are working with a better version
#3124

1 reply

taishiciR Mar 31, 2023

@binmakeswell thanks. If the [lazy init] support nn.module with lora？
For I came acorss a runtime error when trying to shard init an actor in chat example(step3) with lora,an error reporting in the forward stage after lora
which describe in the discussion-3303
#3303

andrew600-sudo · 2023-04-11T13:17:16Z

andrew600-sudo
Apr 11, 2023

https://tendi.ro/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Gemini的zero方案可以初始化出gpt3级别的模型么？ #2479

yhcc Jan 14, 2023

Replies: 3 comments · 4 replies

feifeibear Jan 15, 2023

yhcc Jan 15, 2023 Author

feifeibear Jan 15, 2023

taishiciR Mar 29, 2023

binmakeswell Mar 30, 2023 Maintainer

taishiciR Mar 31, 2023

andrew600-sudo Apr 11, 2023

yhcc
Jan 14, 2023

Replies: 3 comments 4 replies

feifeibear
Jan 15, 2023

yhcc Jan 15, 2023
Author

binmakeswell
Mar 30, 2023
Maintainer

andrew600-sudo
Apr 11, 2023