Selecting batch_size and gradient_accumulation_steps when fine-tuning

adityashukzy · April 12, 2023, 8:08am

Hello!

As I understand it, a larger batch_size makes training a model faster but takes up more memory. Hence, simply increasing the batch size will result in an Out of Memory (OOM) error. To counteract this, we can balance the memory footprint using Gradient Accumulation (albeit at the cost of speed).

In general, for batch_size = N and gradient_accumulation_steps = K, we have an effective batch size of N * K.

My query is the following: say I want to have an effective batch size of 128, is there any tangible difference between N = 64, K = 2 and N = 2, K = 64? If so, which is better?

In a similar vein, is there any rule of thumb dictating this tradeoff? For eg. is N = 64, K = 2 better than N = 32, K = 4? Or than N = 16, K = 8?

Thank you for taking the time to help!

wasifferoze · December 31, 2023, 10:17pm

To optimize your GPU memory usage and accelerate the training speed, it is recommended to use a batch size that is a multiple of a certain factor K. For instance, if the maximum N that fits into your GPU memory is 64 and use K=2 to make it desired effective batch equivalent. This way, you can minimize the value of K and maximize the utilization of your GPU memory, while maintaining a faster training speed.

Topic		Replies	Views
Batch size vs gradient accumulation Beginners	9	33613	November 28, 2024
Switch batch size and gradient accumulation step values mid training Beginners	0	238	February 28, 2024
Batch size, gradient accumulation steps for Linear schedule Models	0	715	May 1, 2021
What is the limit of grad accumulation? Intermediate	2	2912	May 4, 2021
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	10	June 28, 2025

Selecting batch_size and gradient_accumulation_steps when fine-tuning

Related topics