How to choose optimal batch size for training LLMs?

I’m working on a project that involves pre-training GPT-2 style transformers up to GPT-2 medium size (using DDP on 8 A6000s, up to 15B tokens, with AdamW). I’m somewhat confused about how to choose the right batch size for this, and was wondering if anyone has thoughts?

Papers like the GPT-3 paper seem to use a batch size of ~250K tokens (so 250 sequences of 1000 tokens, or 125 sequences of 2000 tokens) for models of GPT-2 Small size, but empirically batch sizes of more like 100K seem to do much better early in training. But maybe they make training noisier which makes the point of convergence higher? I’m unsure!

The main paper I’ve found which seems to think about this is An Empirical Model of Large-Batch Training, which reasons about finding the maximum sensible batch size, but I’m not sure how to interpret it re what happens if you choose smaller batch sizes - does it even matter?

Hey @NeelNanda , I’m facing a similar question at the moment, where according to papers like GPT3 the more tokens per batch the better, but it doesn’t sit with other papers as well with theory of deep learning models in general as I know.

Did you get any conclusions on this? I’d love to hear if you do! :slight_smile:


From this repository:

Summary: The batch size governs the training speed and shouldn’t be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.

Humm, :thinking: can’t you always increase the equivalent batchsize using gradient accumulation? Like

for batch_idx % 8 == 7:

and you can just reduce lr the into

lr = lr/8

Also adding more cards have the same effect, you can basically using 8 GPU with batch_size = 1 and it is the same as 1 GPU with batch_size = 8

So I think the Question really is what batch_size (or equivalent batch size aka GPU_num * Gradient Accumulation * Batch) is optimum

1 Like