How to choose optimal batch size for training LLMs?

Humm, :thinking: can’t you always increase the equivalent batchsize using gradient accumulation? Like

for batch_idx % 8 == 7:
     optimizer.step()

and you can just reduce lr the into

lr = lr/8

Also adding more cards have the same effect, you can basically using 8 GPU with batch_size = 1 and it is the same as 1 GPU with batch_size = 8

So I think the Question really is what batch_size (or equivalent batch size aka GPU_num * Gradient Accumulation * Batch) is optimum

1 Like