I’m working on a project that involves pre-training GPT-2 style transformers up to GPT-2 medium size (using DDP on 8 A6000s, up to 15B tokens, with AdamW). I’m somewhat confused about how to choose the right batch size for this, and was wondering if anyone has thoughts?
Papers like the GPT-3 paper seem to use a batch size of ~250K tokens (so 250 sequences of 1000 tokens, or 125 sequences of 2000 tokens) for models of GPT-2 Small size, but empirically batch sizes of more like 100K seem to do much better early in training. But maybe they make training noisier which makes the point of convergence higher? I’m unsure!
The main paper I’ve found which seems to think about this is An Empirical Model of Large-Batch Training, which reasons about finding the maximum sensible batch size, but I’m not sure how to interpret it re what happens if you choose smaller batch sizes - does it even matter?
Hey @NeelNanda , I’m facing a similar question at the moment, where according to papers like GPT3 the more tokens per batch the better, but it doesn’t sit with other papers as well with theory of deep learning models in general as I know.
Did you get any conclusions on this? I’d love to hear if you do!
Summary:The batch size governs the training speed and shouldn’t be used to directly tune the validation set performance. Often, the ideal batch size will be the largest batch size supported by the available hardware.