Batch size vs gradient accumulation


I have a basic theoretical question. Which one is better for the model and GPU usage?

First option:

--per_device_train_batch_size 8 
--gradient_accumulation_steps 2

Second option:
--per_device_train_batch_size 16

1 Like

If the second one does not OOM, you should have better performance with it. The first is a way to get around the memory error the second would give you.

Both commands are completely equivalent in terms of training done otherwise.


Hi @sgugger

It is better if you can supply some references on this problem.

A source is not necessary for this, I think. The goal of gradient accumulation is exactly to overcome memory constraints of the hardware.

So where is the difference in performance between using GA and without GA as @sgugger mentioned in his answer?

I am not sure that it just involves hardware only.

Using gradient accumulation loops over your forward and backward pass (the number of steps in the loop being the number of gradient accumulation steps). A for loop over the model is less efficient than feeding more data to the model, as you’re not taking advantage of the parallelization your hardware can offer.

The only reason to use gradient accumulation steps is when your whole batch size does not fit on one GPU, so you pay a price in terms of speed to overcome a memory issue.