GAN model with ZeRO3 with offload #3088
Unanswered
EvgenyUgolkov
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dear Team, good day
I try to run the GAN example which you provided (MNIST dataset), but with Zero3 and offloading feature in the configuration file as follow
I put it as a dictionary in the code and initiate the models as
The rest is exactly the same as you provided in the gan example
I try to run it with 2 GPU-s
After 1 successful iteration, i get the following error
[0/1][0/938] Loss_D: 1.4656 Loss_G: 4.7239 D(x): 0.6025 D(G(z)): 0.5315 / 0.0121[0/1][0/938] Loss_D: 1.4656 Loss_G: 4.7239 D(x): 0.6025 D(G(z)): 0.5315 / 0.0121
Traceback (most recent call last):
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 208, in
main()
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 205, in main
train(args)
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 151, in train
output = netD(real)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
Traceback (most recent call last):
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 208, in
main()
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 205, in main
result = hook(self, args)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
train(args)
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 151, in train
output = netD(real)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ret_val = func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in _pre_forward_module_hook
result = hook(self, args)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
self.pre_sub_module_forward_function(module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in _pre_forward_module_hook
return func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 475, in pre_sub_module_forward_function
param_coordinator.trace_prologue(sub_module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
self.pre_sub_module_forward_function(module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
if sub_module != self.__submodule_order[self.__step_id]:
IndexError: tuple index out of range
The whole output file is attached for your convenience
slurm-24400324.out.txt
Could you tell me, what am i doing wrong?
Regards, Evgeny
Beta Was this translation helpful? Give feedback.
All reactions