deepspeed initial entire model on each GPU at begining #3154
-
I find that deepseed put entire model in to one GPU at the very begining and partation next, is this a desired behavior? I can't use v100 (16GB) to train large model (10B+) with pure stage3 no matter how many I used, as it always gets OOM at the begining. And I use gpu with larger mem to do the profile(only deepspeed.initialize) on the same code, result shows that the gpu first peak usage keeps the same with different gpu nums. the peak usage exactly match the requried mem for my model in fp16 format.
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
You will need to init your model with
|
Beta Was this translation helpful? Give feedback.
You will need to init your model with