Fine-tuning nomenclature

Chapter 3 of the course is called “fine-tuning”. Is this use of the term “fine-tuning” consistent with this article (Transfer learning & fine-tuning), which differentiates between transfer learning and fine tuning?

i.e. When we fine-tune a model as in Chapter 3, are we training the whole network, or are we only training the new layer(s) attached to the body of the pre-trained network?

Fine-tuning is the act of re-training a pretrained model on a new dataset/task, it has nothing to do with freezing part of the network or not (at lest in the wat we use it in the course, which is the general way it is used in the published literature AFAIK). Freezing part of the network might help in some situations (like computer vision) but it doesn’t really help for Transformer models, usually it gives worse results.

Thanks for clarifying!

Freezing part of the network might help in some situations (like computer vision) but it doesn’t really help for Transformer models, usually it gives worse results.

Hi @sgugger, I’ve heard this mentioned a few times on here. Just wondering if this is just “common knowledge” of the community (perhaps from the experience of training many models), or it’s from a source that investigates this question. I remember trying to find a definitive answer to this in the literature a while ago and not being able to find much, so any pointers would be appreciated!

It’s more of an empirical knowledge of practitioners than an academic result (note that it’s the same for freezing the body of the network when fine-tuning in computer vision, it’s taught in courses, but there are not many academic papers discussing it, may are not even doing it). I saw a paper confirming this once, but forgot the reference :-/

1 Like

One paper I know of is this one, which looked at freezing various encoder blocks of BERT / RoBERTa. As Sylvain mentions, the performance tends to drop as you freeze more layers.

There are more recent proposals like AutoFreeze which circumvent this drop in performance, but at the expense of a fairly complex training procedure (i.e. you’d be better off distilling the model if training speed is a concern):

Screen Shot 2021-12-09 at 16.18.08

2 Likes