Gradient Accumulation #345
-
Hi @rwightman. Thank you for the great work. While skimming the Thanks in advance! Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@ademyanchuk I don't use it because these models all use BatchNorm by default. Gradient accumulation isn't a clear win with BatchNorm as it does not improve the effective batch size for the BN running stats calc and that can cause instability or poor results with small batches. One could use GroupNorm and several models do support switching the norm layer quite easily, but not something I've experimented with. |
Beta Was this translation helpful? Give feedback.
@ademyanchuk I don't use it because these models all use BatchNorm by default. Gradient accumulation isn't a clear win with BatchNorm as it does not improve the effective batch size for the BN running stats calc and that can cause instability or poor results with small batches. One could use GroupNorm and several models do support switching the norm layer quite easily, but not something I've experimented with.