What is the limit of grad accumulation?

As I understand effective batch size = batch grad accumulation * batch size.

Firstly
Is this the only adjustment that needs to be done, nothing for the learning rate?

Secondly how much can you accumulate. Say I want a batch size of 256, and can only fit a batch size of 4, can I really accumulate gradients over 64 batches or is that way too much and never done in practice?

From the pure math side, there is no limit to how many times you can accumulate gradients.

\frac{\partial \mathcal L}{\partial \theta} = \sum_{accumulation} \sum_{batch\_size} \frac{\partial \mathcal L}{\partial θ}(x_i, y_i) = \sum_{larger\_batch\_size} \frac{\partial \mathcal L}{\partial θ}(x_i, y_i)

I guess, there would be more numerical accuracy / stability issues. Maybe the results would be a bit different from the “true” batch_size=256, but not too much.

I don’t think you need to adjust the learning rate, use the same value you would use for batch_size = 256.

Also remember that it is deep learning — even if you are sure everything is going to be alright, it may be quite different in practice. Just try it out and see.

Hi, thanks for that. I was thinking along those lines, but still unsure about the maths, if it is purely additive, wouldnt the performance not be much worse If you use a batch size of 1 with 64 accumulation steps, is the worse performance about about numerical stability, or are you adding up lots of variances you wouldnt have by taking the gradient in one go?

Maybe you are right and it is only the batchnorm layers, so if you had no batch norm in your model it would be the same.

Anyway I ask as its taking me ages to train one model, so dont want to try out too many learning rates:
Sam Shleifer here said here (transformers/examples/research_projects/seq2seq-distillation at master · huggingface/transformers · GitHub) about tips for Pegasus: "* reduce batch size, and increase gradient accumulation steps so that the product gpus * batch size * gradient_accumulation_steps = 256 . We used--learning-rate = 1e-4 * gradient accumulation steps. "

So that’s why I was wondering if you need to increase the learning rate by that equation.