Why are huge batch sizes used for pretraining and small ones for finetuning?

In most, if not all papers on language models, I find that they often use very large batch sizes for pretraining on a language modeling task. But when they then finetune their model to show its performance on downstream tasks, the batch sizes are suddenly very small.

For instance, the RoBERTa paper shows that its batch size during pretraining was 8k sentences (Table 9 in the appendix), however for finetuning the batches are considerably smaller (Table 10, appendix): 16 (RACE), 48 (SQuAD), 16, 32 (GLUE).

This has puzzled me since forever and I have never discovered the rationale behind this. Is it a matter of scale? Something like: while pretraining you have so much different data, that you just want as much in one go as you can - it does not matter as much that the loss is smoothed out (averaged) over such huge batches. But when finetuning over a smaller dataset you do not want to average the loss over too much of the dataset at once because you then lose peculiarities of samples quickly.

Or is there another reason? All ideas are welcome.


I don’t think they use the same hardware for pretraining and fine-tuning. E.g. multiple TPU pods or a GPU cluster for pretraining allows a big batch size but that’s maybe something the research team can only do once. Fine-tuning, and something more accessible (just one GPU for instance) then requires a smaller batch size to avoid the OOM.

This is just a guess however.


So apparently I never sent this reply, but it was typed already:

That’s actually a very good point that I had never considered.

I wonder whether my argument about batch sizes still holds. 16 is still a quite small batch size, and gradient accumulation is quite cheap.