Why are huge batch sizes used for pretraining and small ones for finetuning?

BramVanroy · October 17, 2021, 12:10am

In most, if not all papers on language models, I find that they often use very large batch sizes for pretraining on a language modeling task. But when they then finetune their model to show its performance on downstream tasks, the batch sizes are suddenly very small.

For instance, the RoBERTa paper shows that its batch size during pretraining was 8k sentences (Table 9 in the appendix), however for finetuning the batches are considerably smaller (Table 10, appendix): 16 (RACE), 48 (SQuAD), 16, 32 (GLUE).

This has puzzled me since forever and I have never discovered the rationale behind this. Is it a matter of scale? Something like: while pretraining you have so much different data, that you just want as much in one go as you can - it does not matter as much that the loss is smoothed out (averaged) over such huge batches. But when finetuning over a smaller dataset you do not want to average the loss over too much of the dataset at once because you then lose peculiarities of samples quickly.

Or is there another reason? All ideas are welcome.

sgugger · October 18, 2021, 1:00am

I don’t think they use the same hardware for pretraining and fine-tuning. E.g. multiple TPU pods or a GPU cluster for pretraining allows a big batch size but that’s maybe something the research team can only do once. Fine-tuning, and something more accessible (just one GPU for instance) then requires a smaller batch size to avoid the OOM.

This is just a guess however.

BramVanroy · April 12, 2022, 10:58am

So apparently I never sent this reply, but it was typed already:

That’s actually a very good point that I had never considered.

I wonder whether my argument about batch sizes still holds. 16 is still a quite small batch size, and gradient accumulation is quite cheap.

msamogh · January 10, 2023, 3:11pm

I’ve noticed a huge increase in performance of my model when I fine tuned T5 with a smaller batch size (16 or 32) than even 128. I think it simply boils down to the model getting to see a more diverse set of samples during fine tuning.

Topic		Replies	Views
Finetuning a pre-trained model Intermediate	0	57	August 21, 2024
Finetuning neox 20b, why is resulting model so small Beginners	1	294	September 19, 2022
Dataset and Training Batching Beginners	1	1433	February 9, 2022
RAG batch size on GPU Beginners	0	640	March 2, 2021
Anyone have idea how we can finetune a model using Trainer API? 🤗Transformers	0	446	April 22, 2022

Why are huge batch sizes used for pretraining and small ones for finetuning?

Related topics