I can see that gradient accumulation step helps to increase batch size.
and also I can understand if the model has Batch Norm layer, gradient accumulation will not guarantee the exact same performance as the model that we trained in a large batch size (not using accumulation)
but, most of models in Transformers are based on transformer architecture which utilizes layer normalization,
so does that mean Can I guarantee that the trained model would give same metric performance in both ways? (e.g batch size 64 with 4 batch per device * 4 gpus * 4 accumulation step == batch size 64 with 16 batch per device * 4 gpus )
My question is for transformer models which use layer normalization, will give same model performance between train batch size in once and using gradient accumulation steps