I’m working on a project that involves pre-training GPT-2 style transformers up to GPT-2 medium size (using DDP on 8 A6000s, up to 15B tokens, with AdamW). I’m somewhat confused about how to choose the right batch size for this, and was wondering if anyone has thoughts?
Papers like the GPT-3 paper seem to use a batch size of ~250K tokens (so 250 sequences of 1000 tokens, or 125 sequences of 2000 tokens) for models of GPT-2 Small size, but empirically batch sizes of more like 100K seem to do much better early in training. But maybe they make training noisier which makes the point of convergence higher? I’m unsure!