So I have been looking at this for the past day and a half.
Here in the code the gradient in gradient accumulation is averaged.
Please explain to me. Gradient accumulation should accumulate (i.e. sum) the gradient, not average it, right? That makes this scaling plain wrong? Am I missing something?
Same holds for multi-gpu parallel training, where the mean() is used directly. Both cases would be closer to their 1 GPU and full batch size equivalent if just a sum was used, right?
Well it actually depend on the loss you are using but most of the time with use a CrossEntropy loss averaged over the samples in the batch (so it’s mostly independant of the batch size).
The natural extension of that to gradient accumulation is to average over the accumulation.
I wrote a bit about that some time ago here: https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
Many thanks for the answer! I somehow did not know about the averaging in CrossEntropy.
Am I correct to assume that the differences that I do get are due to batchnorm (and dropout)?