I download pretrained model from huggingface, and add some layers to it.
Now I only want to change layers I added, so I need to freeze layers of pretrained model while training. How can I do it?
1 Like
Maybe requires_grad=False
?
Hey,
I am trying to figure out how to freeze layers of a model and read that I had to use
for param in model.base_model.parameters():
param.requires_grad = False
if I wanted to freeze the encoder of a pretrained MLM for example. But how do I use this with the Trainer?
I tried the following:
from transformers import BertTokenizer, BertForMaskedLM. LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
model = BertForMaskedLM.from_pretrained('bert-base-uncased'…
opened 07:46PM - 18 Jan 23 UTC
closed 02:28PM - 23 Jan 23 UTC
### Feature request
Attempt to optimize the training for models with weights/pa… rameters that are set to `requires_grad=False`. This is done by excluding these parameters in the optimizer.
### Motivation
I am building a Seq2Seq model where I use a pre-trained model for the encoder. I freeze all the parameters of the encoder by setting `requires_grad=False`. I expected the training to speed up compared to a model where both the encoder and decoder weights are trainable. However, I found that there's no difference in speed and also memory.
I investigated a bit and found that all the model parameters, regardless of whether gradients are required to be computed, are included in the optimizer https://github.com/huggingface/transformers/blob/00ba7cadd812437708b380ab078a3cfe8cfaff31/src/transformers/trainer.py#L1021-L1030
I tested an idea and subclassed the `Seq2SeqTrainer`. So, I updated the above snippet with this:
```Python
optimizer_grouped_parameters = [
{
# Add here the `p.requires_grad` condition
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
# Add here the `p.requires_grad` condition
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
```
Doing this actually improved both the speed and the memory during the training.
I was wondering if this is something we can add to the codebase. If not, I am curious as to why we shouldn't exclude the parameters that are intended not to be trainable in the optimizer.
### Your contribution
I can make the PR if this is an acceptable change. 🤗