How to choose optimal batch size for training LLMs?

I’m working on a project that involves pre-training GPT-2 style transformers up to GPT-2 medium size (using DDP on 8 A6000s, up to 15B tokens, with AdamW). I’m somewhat confused about how to choose the right batch size for this, and was wondering if anyone has thoughts?

Papers like the GPT-3 paper seem to use a batch size of ~250K tokens (so 250 sequences of 1000 tokens, or 125 sequences of 2000 tokens) for models of GPT-2 Small size, but empirically batch sizes of more like 100K seem to do much better early in training. But maybe they make training noisier which makes the point of convergence higher? I’m unsure!

The main paper I’ve found which seems to think about this is An Empirical Model of Large-Batch Training, which reasons about finding the maximum sensible batch size, but I’m not sure how to interpret it re what happens if you choose smaller batch sizes - does it even matter?