How to choose optimal batch size for training LLMs?

NeelNanda · October 1, 2022, 1:41pm

I’m working on a project that involves pre-training GPT-2 style transformers up to GPT-2 medium size (using DDP on 8 A6000s, up to 15B tokens, with AdamW). I’m somewhat confused about how to choose the right batch size for this, and was wondering if anyone has thoughts?

Papers like the GPT-3 paper seem to use a batch size of ~250K tokens (so 250 sequences of 1000 tokens, or 125 sequences of 2000 tokens) for models of GPT-2 Small size, but empirically batch sizes of more like 100K seem to do much better early in training. But maybe they make training noisier which makes the point of convergence higher? I’m unsure!

NeelNanda · October 1, 2022, 1:47pm

The main paper I’ve found which seems to think about this is An Empirical Model of Large-Batch Training, which reasons about finding the maximum sensible batch size, but I’m not sure how to interpret it re what happens if you choose smaller batch sizes - does it even matter?

IdoAmit198 · March 27, 2023, 7:44am

Hey @NeelNanda , I’m facing a similar question at the moment, where according to papers like GPT3 the more tokens per batch the better, but it doesn’t sit with other papers as well with theory of deep learning models in general as I know.

Did you get any conclusions on this? I’d love to hear if you do!

nielsr · March 27, 2023, 10:06am

Hi,

From this repository: https://github.com/google-research/tuning_playbook:

Summary: The batch size governs the training speed and shouldn’t be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.

Rong-Tao · December 18, 2023, 2:01am

Humm, can’t you always increase the equivalent batchsize using gradient accumulation? Like

for batch_idx % 8 == 7:
     optimizer.step()

and you can just reduce lr the into

lr = lr/8

Also adding more cards have the same effect, you can basically using 8 GPU with batch_size = 1 and it is the same as 1 GPU with batch_size = 8

So I think the Question really is what batch_size (or equivalent batch size aka GPU_num * Gradient Accumulation * Batch) is optimum

Topic		Replies	Views
GPT-2 Training Speed Unchanged with Different Batch Size & Grad Accumulation Beginners	1	11	June 28, 2025
How to determine optimal batch & chunk size for MLM? Beginners	1	3357	January 5, 2023
Selecting batch_size and gradient_accumulation_steps when fine-tuning Models	1	2225	December 31, 2023
Batch size vs gradient accumulation Beginners	9	33747	November 28, 2024
Trainer with adaptive batch size? Beginners	0	1035	September 29, 2023

How to choose optimal batch size for training LLMs?

Related topics