Question about Gradient Accumulation step in Trainer

I can see that gradient accumulation step helps to increase batch size.

and also I can understand if the model has Batch Norm layer, gradient accumulation will not guarantee the exact same performance as the model that we trained in a large batch size (not using accumulation)

but, most of models in Transformers are based on transformer architecture which utilizes layer normalization,

so does that mean Can I guarantee that the trained model would give same metric performance in both ways? (e.g batch size 64 with 4 batch per device * 4 gpus * 4 accumulation step == batch size 64 with 16 batch per device * 4 gpus )

In short,
My question is for transformer models which use layer normalization, will give same model performance between train batch size in once and using gradient accumulation steps

1 Like

Yes, layer normalization does track statistics, so you will get the exact same thing with 4 batch size * 4 gradient accumulation or 16 batch size (and that would not be the case with neural nets using BatchNorm indeed).

3 Likes

Thanks! your reply has been so much help to me