Question about Gradient Accumulation step in Trainer

I can see that gradient accumulation step helps to increase batch size.

and also I can understand if the model has Batch Norm layer, gradient accumulation will not guarantee the exact same performance as the model that we trained in a large batch size (not using accumulation)

but, most of models in Transformers are based on transformer architecture which utilizes layer normalization,

so does that mean Can I guarantee that the trained model would give same metric performance in both ways? (e.g batch size 64 with 4 batch per device * 4 gpus * 4 accumulation step == batch size 64 with 16 batch per device * 4 gpus )

In short,
My question is for transformer models which use layer normalization, will give same model performance between train batch size in once and using gradient accumulation steps

Yes, layer normalization does track statistics, so you will get the exact same thing with 4 batch size * 4 gradient accumulation or 16 batch size (and that would not be the case with neural nets using BatchNorm indeed).

1 Like

Thanks! your reply has been so much help to me