How to choose optimal batch size for training LLMs?

Rong-Tao · December 18, 2023, 2:01am

Humm, can’t you always increase the equivalent batchsize using gradient accumulation? Like

for batch_idx % 8 == 7:
     optimizer.step()

and you can just reduce lr the into

lr = lr/8

Also adding more cards have the same effect, you can basically using 8 GPU with batch_size = 1 and it is the same as 1 GPU with batch_size = 8

So I think the Question really is what batch_size (or equivalent batch size aka GPU_num * Gradient Accumulation * Batch) is optimum

Topic		Replies	Views
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	11	June 28, 2025
How to determine optimal batch & chunk size for MLM? Beginners	1	3357	January 5, 2023
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2225	December 31, 2023
Batch size vs gradient accumulation Beginners	9	33752	November 28, 2024
Trainer with adaptive batch size? Beginners	0	1035	September 29, 2023