Gradient accumulation averages over gradient

MarktHart · November 12, 2020, 12:06pm

So I have been looking at this for the past day and a half.

Here in the code the gradient in gradient accumulation is averaged.

Please explain to me. Gradient accumulation should accumulate (i.e. sum) the gradient, not average it, right? That makes this scaling plain wrong? Am I missing something?

Same holds for multi-gpu parallel training, where the mean() is used directly. Both cases would be closer to their 1 GPU and full batch size equivalent if just a sum was used, right?

thomwolf · November 12, 2020, 12:51pm

Well it actually depend on the loss you are using but most of the time with use a CrossEntropy loss averaged over the samples in the batch (so it’s mostly independant of the batch size).

The natural extension of that to gradient accumulation is to average over the accumulation.

I wrote a bit about that some time ago here: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

MarktHart · November 12, 2020, 4:33pm

Many thanks for the answer! I somehow did not know about the averaging in CrossEntropy.

Am I correct to assume that the differences that I do get are due to batchnorm (and dropout)?

Topic		Replies	Views
Gradient accumulation loss compute Beginners	0	76	June 4, 2024
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2622	September 10, 2021
Batch size vs gradient accumulation Beginners	9	33680	November 28, 2024
What is the limit of grad accumulation? Intermediate	2	2912	May 4, 2021
Gradient accumulation gives different results compared to full batch Models	1	1177	December 15, 2023

Gradient accumulation averages over gradient

Related topics