Fine-tuning nomenclature

hannahmc · November 30, 2021, 8:16am

Chapter 3 of the course is called “fine-tuning”. Is this use of the term “fine-tuning” consistent with this article (Transfer learning & fine-tuning), which differentiates between transfer learning and fine tuning?

i.e. When we fine-tune a model as in Chapter 3, are we training the whole network, or are we only training the new layer(s) attached to the body of the pre-trained network?

sgugger · November 30, 2021, 4:15pm

Fine-tuning is the act of re-training a pretrained model on a new dataset/task, it has nothing to do with freezing part of the network or not (at lest in the wat we use it in the course, which is the general way it is used in the published literature AFAIK). Freezing part of the network might help in some situations (like computer vision) but it doesn’t really help for Transformer models, usually it gives worse results.

hannahmc · December 3, 2021, 8:59am

Thanks for clarifying!

tomroth1001 · December 9, 2021, 1:39am

Freezing part of the network might help in some situations (like computer vision) but it doesn’t really help for Transformer models, usually it gives worse results.

Hi @sgugger, I’ve heard this mentioned a few times on here. Just wondering if this is just “common knowledge” of the community (perhaps from the experience of training many models), or it’s from a source that investigates this question. I remember trying to find a definitive answer to this in the literature a while ago and not being able to find much, so any pointers would be appreciated!

sgugger · December 9, 2021, 4:50am

It’s more of an empirical knowledge of practitioners than an academic result (note that it’s the same for freezing the body of the network when fine-tuning in computer vision, it’s taught in courses, but there are not many academic papers discussing it, may are not even doing it). I saw a paper confirming this once, but forgot the reference :-/

lewtun · December 9, 2021, 3:20pm

One paper I know of is this one, which looked at freezing various encoder blocks of BERT / RoBERTa. As Sylvain mentions, the performance tends to drop as you freeze more layers.

There are more recent proposals like AutoFreeze which circumvent this drop in performance, but at the expense of a fairly complex training procedure (i.e. you’d be better off distilling the model if training speed is a concern):

Screen Shot 2021-12-09 at 16.18.08

Topic		Replies	Views
Does fine-tuning mean retraining the entire model? Beginners	2	5821	November 22, 2022
The point of using pretrained model if I don't freeze layers Beginners	1	8519	May 31, 2023
Can you fine tune fine-tuned models? Beginners	4	2830	September 12, 2023
Pretraining or Finetuning Beginners	1	138	October 6, 2024
What is transfer learning and why is it needed? Beginners	1	2096	March 16, 2021

Fine-tuning nomenclature

Related topics