In most, if not all papers on language models, I find that they often use very large batch sizes for pretraining on a language modeling task. But when they then finetune their model to show its performance on downstream tasks, the batch sizes are suddenly very small.
For instance, the RoBERTa paper shows that its batch size during pretraining was 8k sentences (Table 9 in the appendix), however for finetuning the batches are considerably smaller (Table 10, appendix): 16 (RACE), 48 (SQuAD), 16, 32 (GLUE).
This has puzzled me since forever and I have never discovered the rationale behind this. Is it a matter of scale? Something like: while pretraining you have so much different data, that you just want as much in one go as you can - it does not matter as much that the loss is smoothed out (averaged) over such huge batches. But when finetuning over a smaller dataset you do not want to average the loss over too much of the dataset at once because you then lose peculiarities of samples quickly.
Or is there another reason? All ideas are welcome.