What is the limit of grad accumulation?

jordiclive · May 1, 2021, 4:55pm

As I understand effective batch size = batch grad accumulation * batch size.

Firstly
Is this the only adjustment that needs to be done, nothing for the learning rate?

Secondly how much can you accumulate. Say I want a batch size of 256, and can only fit a batch size of 4, can I really accumulate gradients over 64 batches or is that way too much and never done in practice?

dropout05 · May 2, 2021, 6:02pm

From the pure math side, there is no limit to how many times you can accumulate gradients.

\frac{\partial \mathcal L}{\partial \theta} = \sum_{accumulation} \sum_{batch\_size} \frac{\partial \mathcal L}{\partial θ}(x_i, y_i) = \sum_{larger\_batch\_size} \frac{\partial \mathcal L}{\partial θ}(x_i, y_i)

I guess, there would be more numerical accuracy / stability issues. Maybe the results would be a bit different from the “true” batch_size=256, but not too much.

I don’t think you need to adjust the learning rate, use the same value you would use for batch_size = 256.

Also remember that it is deep learning — even if you are sure everything is going to be alright, it may be quite different in practice. Just try it out and see.

jordiclive · May 4, 2021, 11:34am

Hi, thanks for that. I was thinking along those lines, but still unsure about the maths, if it is purely additive, wouldnt the performance not be much worse If you use a batch size of 1 with 64 accumulation steps, is the worse performance about about numerical stability, or are you adding up lots of variances you wouldnt have by taking the gradient in one go?

Maybe you are right and it is only the batchnorm layers, so if you had no batch norm in your model it would be the same.

Anyway I ask as its taking me ages to train one model, so dont want to try out too many learning rates:
Sam Shleifer here said here (transformers/examples/research_projects/seq2seq-distillation at master · huggingface/transformers · GitHub) about tips for Pegasus: "* reduce batch size, and increase gradient accumulation steps so that the product gpus * batch size * gradient_accumulation_steps = 256 . We used--learning-rate = 1e-4 * gradient accumulation steps. "

So that’s why I was wondering if you need to increase the learning rate by that equation.

Topic		Replies	Views
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2225	December 31, 2023
Batch size vs gradient accumulation Beginners	9	33746	November 28, 2024
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2623	September 10, 2021
Is there a standard way to handle leftover batches when using gradient accumulation? Intermediate	1	616	November 22, 2021
Gradient accumulation: should I duplicate data? 🤗Transformers	7	1014	February 1, 2021

What is the limit of grad accumulation?

Related topics