How to freeze layers using trainer?

Hey,

I am trying to figure out how to freeze layers of a model and read that I had to use

for param in model.base_model.parameters():
    param.requires_grad = False

if I wanted to freeze the encoder of a pretrained MLM for example. But how do I use this with the Trainer?
I tried the following:

from transformers import BertTokenizer, BertForMaskedLM. LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

for param in model.base_model.parameters():
    param.requires_grad = False

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=in_path,
    block_size=512,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir=out_path,
    overwrite_output_dir=True,
    num_train_epochs=25,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

trainer.train()

If the encoder was frozen I would expect it to produce the same outputs as a fresh instance of the pretrained encoder, but it doesn’t:

model_fresh = BertForMaskedLM.from_pretrained('bert-base-uncased')
inputs = tokenizer("This is a boring test sentence", return_tensors="pt")
torch.all(model.bert(**inputs)[0].eq(model_fresh.bert(**inputs)[0]))
--> tensor(false)

So I must be doing somethin wrong here, I guess the Trainer is reseting the requires_grad attribute and I have to overwrite it somehow after I instanciated the trainer?

Thanks in advance!
Johannes

1 Like

Looking at the source code of BertForMaskedLM, the base model is the “bert” attribute, not the “base_model” attribute. So if you want to freeze the parameters of the base model before training, you should type

for param in model.bert.parameters():
    param.requires_grad = False

instead.

@nielsr base_model is an attribute that will work on all the PreTraineModel (to make it easy to access the encoder in a generic fashion) :slight_smile:

The Trainer puts your model into training mode, so your difference might simply come from that (there are dropouts in the model). You should check if putting it back in eval mode solves your problem.

2 Likes

@sgugger oh didn’t know that, I learn every day!

1 Like

OMG! This is so obvious and I cant believe I didn’t realize that. Will test and report! Thanks :slight_smile:

@sgugger model.eval() should have done the trick, right? I am afraid the results still don’t match :frowning:

You should inspect the weights to see where they difer then. Trainer will not change the requires_grad value of your parameters.

@sgugger Thanks, that was important to know for me so I knew I had to be the one screwing up somewhere else and I did somehow manage that :smile:

Hi,

I tried your code, but I am getting this error:

AttributeError: 'RobertaForMaskedLM' object has no attribute 'bert'

Hey,

yeah, this is because you are using roberta instead of bert, therefore it uses .roberta to store the encoder. I believe there is some model independant keyword like “base_model” or something, but I dont know right now (im on vacation, but maybe you can try or google it). Hope that helps!

Best
Johannes

Sorry, answered per mail, sgugger literary provided the base_model keyword in this thread, so there you go :wink: