Selecting batch_size and gradient_accumulation_steps when fine-tuning

Hello!

As I understand it, a larger batch_size makes training a model faster but takes up more memory. Hence, simply increasing the batch size will result in an Out of Memory (OOM) error. To counteract this, we can balance the memory footprint using Gradient Accumulation (albeit at the cost of speed).

In general, for batch_size = N and gradient_accumulation_steps = K, we have an effective batch size of N * K.

My query is the following: say I want to have an effective batch size of 128, is there any tangible difference between N = 64, K = 2 and N = 2, K = 64? If so, which is better?

In a similar vein, is there any rule of thumb dictating this tradeoff? For eg. is N = 64, K = 2 better than N = 32, K = 4? Or than N = 16, K = 8?

Thank you for taking the time to help!

To optimize your GPU memory usage and accelerate the training speed, it is recommended to use a batch size that is a multiple of a certain factor K. For instance, if the maximum N that fits into your GPU memory is 64 and use K=2 to make it desired effective batch equivalent. This way, you can minimize the value of K and maximize the utilization of your GPU memory, while maintaining a faster training speed.