Is Sync BatchNorm supported? #2509

nynyg · 2020-07-05T06:32:49Z

nynyg
Jul 5, 2020

Does pytorch-lightning support synchronized batch normalization (SyncBN) when training with DDP? If so, how to use it?

If not, Apex has implemented SyncBN and one can use it with native PyTorch and Apex by:

from apex import amp
from apex.parallel import convert_syncbn_model

model = apex.parallel.convert_syncbn_model(model)
model, optimizer = amp.initialize(model, optimizer)

How to use them under the pytorch-lightning scheme?

SyncBN makes a big difference when training the model with DDP and it would be great to know how to use it in pytorch-lightning.

Thanks!

ruotianluo · 2020-07-08T16:51:43Z

ruotianluo
Jul 8, 2020

I am also curious. My guess is you need to convert_sync_bn manually, because sync bn is more inside building model part not trainer engine part. Do you have any progress?

0 replies

ananyahjha93 · 2020-08-07T21:45:26Z

ananyahjha93
Aug 7, 2020

@Yelen719 @ruotianluo we support sync_batchnorm in lightning now.

0 replies

phongnhhn92 · 2020-08-18T06:53:27Z

phongnhhn92
Aug 18, 2020

@ananyahjha93

Hi, is there any tutorial how to use SyncBatchNorm in lightning ?

0 replies

DKandrew · 2020-09-11T06:49:14Z

DKandrew
Sep 11, 2020

@phongnhhn92 With some search in the doc: https://pytorch-lightning.readthedocs.io/en/latest/trainer.html#sync-batchnorm

0 replies

phongnhhn92 · 2020-09-11T07:27:43Z

phongnhhn92
Sep 11, 2020

Hi @DKandrew @ananyahjha93 , Can u provide an example of how to use it. Of course, as easy as it sounds I can just add that option into Trainer. My question is that will that work out of the box for model using pytorch Synbatchnorm or above SyncBN from Apex ?

0 replies

DKandrew · 2020-09-11T08:09:57Z

DKandrew
Sep 11, 2020

Hi @phongnhhn92

Here is an example: https://github.com/PyTorchLightning/pytorch-lightning/blob/114af8ba9fc42fcf7053fa06299fbe4aecab8a06/pl_examples/basic_examples/sync_bn.py

By the way, I don't think the example given here is completely correct: it does not set the random seed properly. Based on my personal understanding, the seed should be called after all the processes have been "created" (or "spawned" if you may). Here, the mistake is that the random seed is set only on the main process. I am not 100% sure about my analysis tho, not sure if a call at line 24 of the example can set the seed to all the processes (a Python question). And unfortunately Lightning does not have good documentation for this (I raise an issue #3460)

I believe that it is using pytorch Synbatchnorm. Check out the source code here

0 replies

phongnhhn92 · 2020-09-11T08:31:37Z

phongnhhn92
Sep 11, 2020

Hi @DKandrew , after reading the example, I think we should define our model with regular BatchNorm and then if we decide to use the option sync_batchnorm = true in Trainer then the framework will convert all those BatchNorm layer into SyncBatchNorm for us. I will test this in my code to see if it works like that.
However, I wonder that is there any difference between Apex SyncBatchNorm and Pytorch SyncBatchNorm ? Which one is better to use ?
I am also curios about that function seed_everything() function in the issue #3460. Hopefully, we can have explanation from Pytorch team on this.

0 replies

DKandrew · 2020-09-11T18:46:55Z

DKandrew
Sep 11, 2020

Hi @phongnhhn92, from my personal experience, there is not much difference between Apex and PyTorch SyncBatchNorm and I vaguely remember that Apex developers have a close relationship with PyTorch's so their implementations may be fundamentally the same (don't quote me, please put a grant of salt on this). I have used nn.SyncBatchNorm for a while for semantic segmentation tasks and haven't encountered any issue so far, my network output is descent so I would say the PyTorch one is safe to use.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Sync BatchNorm supported? #2509

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply