Fine Tune BERT Models

Hey,

curious question to illuminate my understanding.

Fine Tuning a BERT model for you downstream task can be important. So I like to tune the BERT weights. Thus, I can extract them from the BertForSequenceClassification which I can fine tune.

if you fine tune eg. BertForSequenceClassification you tune the weights of the BERT model and the classifier layer too.

But for making right fine tune, you would first need freeze the BERT weights, and tune the classifier. Afterwards you fine tune the BERT weights too, right?

Now, there are myriads of ways to finetune the BERT weights?
If I just use the main BERT model together with arbitrary neural network architecture afterwards I could fine tune the BERT weights in this way too, right?

Any suggestions?

Also what came in my mind. That TFBertForSequence is using the pooled_output. So the model is finetuned viad this pooled_output.

But instead I could use the cls embedding or the globalaveragepooling of the hiddensequence for finetuning (pass to the classifier layer), right?

Hi datistiquo,

when you fine-tune BERT, you can choose whether to freeze the BERT layers or not. Do you want BERT to learn to embed the words in a slightly different way, based on your new data, or do you just want to learn to classify the texts in a new way (with the standard BERT embedding of the words)?

I wanted to use BertViz visualisation to see what effect the classification tuning had on the attention heads, so I did fine-tuning with the first 8 layers of BERT frozen and the remaining 4 layers unfrozen.

Some people suggest doing gradual unfreezing of the BERT layers, ie finetuning with BERT frozen, then finetuning a bit more with just one layer unfrozen, etc.

I believe it should be possible to use the main BertModel together with your own neural network achitecture afterwards, and fine tune the weights in that way too (but I couldn’t make that work).

By the way, are you using BertForSequenceClassification (in pytorch) or TFBertForSequenceClassification (in tensorflow)?

BertForSequenceClassification is a "Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) ". If you want to use a different kind of output, you might consider using BertModel instead.

As I already wrote in my initial post, there are myriads ways. You could build your own head with a neural network and then unfreeze the weights. The weights are finetuned still, right?

I just asked if someone had this in mind too and then, if you fine tune the TFBertForSequences you have a head on top. If yo then finetune, you train the head and the weights, but this is bad as I read from the keras homepage about finetuning…

Hello again, I assume this is the section of keras docs that you are referring to :slight_smile:

Once your model has converged on the new data, you can try to unfreeze all or part of the base model and retrain the whole model end-to-end with a very low learning rate.

This is an optional last step that can potentially give you incremental improvements. It could also potentially lead to quick overfitting – keep that in mind.

It is critical to only do this step after the model with frozen layers has been trained to convergence. If you mix randomly-initialized trainable layers with trainable layers that hold pre-trained features, the randomly-initialized layers will cause very large gradient updates during training, which will destroy your pre-trained features.

It’s also critical to use a very low learning rate at this stage, because you are training a much larger model than in the first round of training, on a dataset that is typically very small. As a result, you are at risk of overfitting very quickly if you apply large weight updates. Here, you only want to readapt the pretrained weights in an incremental way.

That is very interesting. Chollet is certainly worth attending to!

I note that he was not talking specifically about BERT (or Transformer models). However, I think I should train my model again, starting with all the BERT layers frozen, and only then unfreeze some of them.

I note that Chollet doesn’t say that training the BERT weights is always Bad, but that we need to use a small learning rate to do so.