The point of using pretrained model if I don't freeze layers

Hi, a newbie here. I am using Trainer API to fine-tune Bert model for classification tasks.

As I understand correctly, Trainer doesn’t freeze any layer of the pre-trained model. As I follow tutorials, I didn’t see any mention of freezing either. (Fine-tune a pretrained model)

So I followed the tutorial and completed fine-tuning. After, I dig some GitHub projects to find examples for improving the accuracy, and saw freezing layer codes just like below. So without knowing it, I trained the whole network and wasted all previous learnings of Bert model (right?).

for param in model.roberta.parameters():
    param.requires_grad = False

print(f"num params:", model.num_parameters())
print(f"num trainable params:", model.num_parameters(only_trainable=True))

So, I want to ask what is the point of using pre-trained model (or transfer learning, fine tuning as a concept) if I train Bert from scratch or should I freeze some layers as I saw on other’s code?

This part is mostly incorrect - fine-tuning a model often causes some degradation and “forgetting” of the knowledge the model learned during pretraining. But in most cases, it won’t lose very much and definitely not all of the previous knowledge.

So, I want to ask what is the point of using pre-trained model (or transfer learning, fine tuning as a concept) if I train Bert from scratch or should I freeze some layers as I saw on other’s code?

The main reason not to freeze layers is so that you can harness all of the model’s parameters to help it learn the task. Ie all of the model’s layers will be updated to better represent your training data and reduce loss. In my experience, this usually results in better performance when fine-tuning than freezing layers.

I can’t say why the particular repos you were looking at froze layers, but there are reasons to do this in some cases. One is computational efficiency - if you’re not updating all the parameters of the model, training will be faster and use less memory (less GPU memory needed for holding gradients for all the model parameters). Another is perhaps the dataset is very small and they were worried about overfitting, so they reduced the number of parameters dedicated to learning the task. Maybe they also just did experiments and observed that it helped with their task.

Ultimately, I would tend to recommend against freezing layers unless you really need to save memory.

2 Likes