Add support for Zero3 FSDP #3753

stephenroller · 2021-06-28T14:04:42Z

In #3740, we added support for FullyShardedDataParallel, but limited implementation to that of Zero2, not Zero3. Zero3 results in substantial decreases of memory usage compared with Zero2 while bringing speed back in line with vanilla DDP.

We have already added support for this (via manual calls to wrap) within the Transformer modules, but we still cannot support Zero3. The main issue is that Zero3 assumes that every worker calls forward the exact same number of times, and performs a parameter-transfer during this forward (moving the sharded parameters to each worker just in time). ParlAI cannot provide this guarantee though because:

During validation, each worker sees a variable number of examples. This is okay in itself, but it is problematic (hang) if it results in any worker having extra batches.
During generation, workers will have a variable number of forwards due to the variable sequence length. While everything stays happy for a while, if one worker ends the run with needing more generations than the others, we will get hangs.

It seems far too difficult (and ugly) to try to force this equality in worlds.py or in our dataset sharding. So our best future bet is to implement something like .join() in vanilla DDP. It would work roughly as follows:

Every worker in forward tries to synchronize a True boolean saying "Am I doing a true forward?"
Upon __exit__ of the context, workers enter an infinite loop where they sync a False boolean. As long as any worker is providing a True value, they participate in a dummy batch forward.
When all workers agree on the False boolean, we can end the infinite loop.

This feature makes the most sense to implement upstream in Fairscale, and then integrate into ParlAI.

The text was updated successfully, but these errors were encountered:

blefaudeux · 2021-07-08T01:14:45Z

During validation, each worker sees a variable number of examples. This is okay in itself, but it is problematic (hang) if it results in any worker having extra batches.

Pytorch distributed has a wrapper for that, I've tried to look it up to no avail (maybe not public yet). Not sure how applicable that would be, just a heads up

seed93 · 2021-08-25T03:00:06Z

So any updates here? We are really looking forward to using ZERO3 for boosting.

blefaudeux · 2021-08-25T03:34:18Z

During validation, each worker sees a variable number of examples. This is okay in itself, but it is problematic (hang) if it results in any worker having extra batches.

Pytorch distributed has a wrapper for that, I've tried to look it up to no avail (maybe not public yet). Not sure how applicable that would be, just a heads up

https://pytorch.org/tutorials/advanced/generic_join.html
@stephenroller would that help ? cc @min-xu-ai

stephenroller added the never-stale label Jun 28, 2021

stephenroller mentioned this issue Jun 28, 2021

Fully Sharded Data Parallel #3740

Merged

klshuster mentioned this issue Dec 1, 2022

[FSDP] Zero 3 Optimization Support #4903

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Zero3 FSDP #3753

Add support for Zero3 FSDP #3753

stephenroller commented Jun 28, 2021

blefaudeux commented Jul 8, 2021

seed93 commented Aug 25, 2021

blefaudeux commented Aug 25, 2021

Add support for Zero3 FSDP #3753

Add support for Zero3 FSDP #3753

Comments

stephenroller commented Jun 28, 2021

blefaudeux commented Jul 8, 2021

seed93 commented Aug 25, 2021

blefaudeux commented Aug 25, 2021