Batch size vs gradient accumulation

ozcangundes · April 1, 2021, 3:09pm

Hi,

I have a basic theoretical question. Which one is better for the model and GPU usage?

First option:

--per_device_train_batch_size 8 
--gradient_accumulation_steps 2

Second option:
--per_device_train_batch_size 16

sgugger · April 1, 2021, 3:55pm

If the second one does not OOM, you should have better performance with it. The first is a way to get around the memory error the second would give you.

Both commands are completely equivalent in terms of training done otherwise.

bui · December 21, 2021, 12:09pm

Hi @sgugger

It is better if you can supply some references on this problem.

BramVanroy · December 21, 2021, 12:13pm

A source is not necessary for this, I think. The goal of gradient accumulation is exactly to overcome memory constraints of the hardware.

bui · December 21, 2021, 12:30pm

So where is the difference in performance between using GA and without GA as @sgugger mentioned in his answer?

I am not sure that it just involves hardware only.

sgugger · December 21, 2021, 3:14pm

Using gradient accumulation loops over your forward and backward pass (the number of steps in the loop being the number of gradient accumulation steps). A for loop over the model is less efficient than feeding more data to the model, as you’re not taking advantage of the parallelization your hardware can offer.

The only reason to use gradient accumulation steps is when your whole batch size does not fit on one GPU, so you pay a price in terms of speed to overcome a memory issue.

lkurlandski · July 11, 2023, 10:07pm

As far as I am aware, the common rule of thumb to select the batch size is “as big as your hardware can support”. For example, the most recent leaks concerning GPT-4’s training suggest that a staggering batch size of 60M is used. This makes me wonder how an engineer should balance the batch size and gradient accumulation steps hyperparameters. For example, at what point do the potential drawback of increasing gradient accumulation steps outweigh the benefits that are attained by using large batch sizes (I guess this particular question would specifically pertain to the clear performance benefits of large batch sizes vs the possible convergence benefits)?

Ntsako · April 12, 2024, 11:42am

such a beautiful answer bro, it just clicked

xshubhamx · April 17, 2024, 6:10am

Isnt it the opposite? Using batched input results in higher memory usage and not gradient accumulation. If gradient accumulation is giving an OOM (Out Of Memory Error), it is guranteed that the first one will also give the same error

CChung-at-VSA · November 28, 2024, 11:56pm

Thanks to @ozcangundes for this question. It helps me to understand the interaction of batch_size and gradient_accumulation_steps, and to calculate the number of chunks/rows processed in each step (basically the product of the two), and the number of steps (# of chunks/# of chunks processed in each step).
Although theoretically, First option and Second option should be the essentially the same (round error notwithstanding), be aware of a bug in small batch size that got fixed just recently.
Here is the link:
Fine-tuning with small batch sizes and gradient accumulation poorly perform if you use Transformers (TRL)!.
Basically, batch_size=1 and gradient_accumulation_steps=32 is much worse than batch_size= 32 and gradient_accumulation_steps=1 while they are mathematically equivalent! From this article, batch_size of 8 seems to be just good enough.
Here is one of the fix (and HF has also a fix subsequently): LLM training bug fixes - Gradient accumulation was wrong
Bugs in LLM Training - Gradient Accumulation Fix
Transformers Release Notes v4.46.3 and v4.46.0
This bug seems to have pretty widespread consequence, and could affect many models.

Topic		Replies	Views
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2305	December 31, 2023
What is the limit of grad accumulation? Intermediate	2	2958	May 4, 2021
Question about Gradient Accumulation step in Trainer 🤗Transformers	2	2662	September 10, 2021
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	33	June 28, 2025
Switch batch size and gradient accumulation step values mid training Beginners	0	249	February 28, 2024

Batch size vs gradient accumulation

Related topics