Batch size vs gradient accumulation

lkurlandski · July 11, 2023, 10:07pm

As far as I am aware, the common rule of thumb to select the batch size is “as big as your hardware can support”. For example, the most recent leaks concerning GPT-4’s training suggest that a staggering batch size of 60M is used. This makes me wonder how an engineer should balance the batch size and gradient accumulation steps hyperparameters. For example, at what point do the potential drawback of increasing gradient accumulation steps outweigh the benefits that are attained by using large batch sizes (I guess this particular question would specifically pertain to the clear performance benefits of large batch sizes vs the possible convergence benefits)?

Topic		Replies	Views
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2279	December 31, 2023
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	28	June 28, 2025
Switch batch size and gradient accumulation step values mid training Beginners	0	244	February 28, 2024
What is the limit of grad accumulation? Intermediate	2	2940	May 4, 2021
How is it possible to get GPU memory errors when increasing the gradient_accumulation steps? Intermediate	1	1392	January 22, 2024

Batch size vs gradient accumulation

Related topics