Question about Gradient Accumulation step in Trainer

ben9004 · September 10, 2021, 8:38am

I can see that gradient accumulation step helps to increase batch size.

and also I can understand if the model has Batch Norm layer, gradient accumulation will not guarantee the exact same performance as the model that we trained in a large batch size (not using accumulation)

but, most of models in Transformers are based on transformer architecture which utilizes layer normalization,

so does that mean Can I guarantee that the trained model would give same metric performance in both ways? (e.g batch size 64 with 4 batch per device * 4 gpus * 4 accumulation step == batch size 64 with 16 batch per device * 4 gpus )

In short,
My question is for transformer models which use layer normalization, will give same model performance between train batch size in once and using gradient accumulation steps

sgugger · September 10, 2021, 1:18pm

Yes, layer normalization does track statistics, so you will get the exact same thing with 4 batch size * 4 gradient accumulation or 16 batch size (and that would not be the case with neural nets using BatchNorm indeed).

ben9004 · September 10, 2021, 2:50pm

Thanks! your reply has been so much help to me

Topic		Replies	Views
Gradient accumulation averages over gradient 🤗Transformers	2	2040	November 12, 2020
What is the limit of grad accumulation? Intermediate	2	2924	May 4, 2021
Gradient accumulation gives different results compared to full batch Models	1	1187	December 15, 2023
Batch size vs gradient accumulation Beginners	9	34217	November 28, 2024
Using gradient_accumulation_steps does not give the same results 🤗Accelerate	0	518	February 18, 2023

Question about Gradient Accumulation step in Trainer

Related topics