Selecting batch_size and gradient_accumulation_steps when fine-tuning


As I understand it, a larger batch_size makes training a model faster but takes up more memory. Hence, simply increasing the batch size will result in an Out of Memory (OOM) error. To counteract this, we can balance the memory footprint using Gradient Accumulation (albeit at the cost of speed).

In general, for batch_size = N and gradient_accumulation_steps = K, we have an effective batch size of N * K.

My query is the following: say I want to have an effective batch size of 128, is there any tangible difference between N = 64, K = 2 and N = 2, K = 64? If so, which is better?

In a similar vein, is there any rule of thumb dictating this tradeoff? For eg. is N = 64, K = 2 better than N = 32, K = 4? Or than N = 16, K = 8?

Thank you for taking the time to help!