Fine Tune BERT Models

Hey,

curious question to illuminate my understanding.

Fine Tuning a BERT model for you downstream task can be important. So I like to tune the BERT weights. Thus, I can extract them from the BertForSequenceClassification which I can fine tune.

if you fine tune eg. BertForSequenceClassification you tune the weights of the BERT model and the classifier layer too.

But for making right fine tune, you would first need freeze the BERT weights, and tune the classifier. Afterwards you fine tune the BERT weights too, right?

Now, there are myriads of ways to finetune the BERT weights?
If I just use the main BERT model together with arbitrary neural network architecture afterwards I could fine tune the BERT weights in this way too, right?

Any suggestions?

Also what came in my mind. That TFBertForSequence is using the pooled_output. So the model is finetuned viad this pooled_output.

But instead I could use the cls embedding or the globalaveragepooling of the hiddensequence for finetuning (pass to the classifier layer), right?

Hi datistiquo,

when you fine-tune BERT, you can choose whether to freeze the BERT layers or not. Do you want BERT to learn to embed the words in a slightly different way, based on your new data, or do you just want to learn to classify the texts in a new way (with the standard BERT embedding of the words)?

I wanted to use BertViz visualisation to see what effect the classification tuning had on the attention heads, so I did fine-tuning with the first 8 layers of BERT frozen and the remaining 4 layers unfrozen.

Some people suggest doing gradual unfreezing of the BERT layers, ie finetuning with BERT frozen, then finetuning a bit more with just one layer unfrozen, etc.

I believe it should be possible to use the main BertModel together with your own neural network achitecture afterwards, and fine tune the weights in that way too (but I couldn’t make that work).

By the way, are you using BertForSequenceClassification (in pytorch) or TFBertForSequenceClassification (in tensorflow)?

BertForSequenceClassification is a "Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) ". If you want to use a different kind of output, you might consider using BertModel instead.

1 Like

As I already wrote in my initial post, there are myriads ways. You could build your own head with a neural network and then unfreeze the weights. The weights are finetuned still, right?

I just asked if someone had this in mind too and then, if you fine tune the TFBertForSequences you have a head on top. If yo then finetune, you train the head and the weights, but this is bad as I read from the keras homepage about finetuning…

Hello again, I assume this is the section of keras docs that you are referring to :slight_smile:

Once your model has converged on the new data, you can try to unfreeze all or part of the base model and retrain the whole model end-to-end with a very low learning rate.

This is an optional last step that can potentially give you incremental improvements. It could also potentially lead to quick overfitting – keep that in mind.

It is critical to only do this step after the model with frozen layers has been trained to convergence. If you mix randomly-initialized trainable layers with trainable layers that hold pre-trained features, the randomly-initialized layers will cause very large gradient updates during training, which will destroy your pre-trained features.

It’s also critical to use a very low learning rate at this stage, because you are training a much larger model than in the first round of training, on a dataset that is typically very small. As a result, you are at risk of overfitting very quickly if you apply large weight updates. Here, you only want to readapt the pretrained weights in an incremental way.

That is very interesting. Chollet is certainly worth attending to!

I note that he was not talking specifically about BERT (or Transformer models). However, I think I should train my model again, starting with all the BERT layers frozen, and only then unfreeze some of them.

I note that Chollet doesn’t say that training the BERT weights is always Bad, but that we need to use a small learning rate to do so.

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks A Beginner’s Guide to NLP and Transfer Learning in TF 2.0

After (optionally) modifying DistilBERT’s configuration class, we can pass both the model name and configuration object to the .from_pretrained() method of the TFDistilBertModel class to instantiate the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head). We do not want any task-specific head attached because we simply want the pre-trained weights of the base model to provide a general understanding of the English language, and it will be our job to add our own classification head during the fine-tuning process

Because DistilBERT’s pre-trained weights will serve as the basis for our model, we wish to conserve and prevent them from updating during the initial stages of training when our model is beginning to learn reasonable weights for our added classification layers. To temporarily freeze DistilBERT’s pre-trained weights, set layer.trainable = False for each of DistilBERT’s layers, and we can later unfreeze them by setting layer.trainable = True once model performance converges.

from transformers import TFDistilBertModel, DistilBertConfig
DISTILBERT_DROPOUT = 0.2
DISTILBERT_ATT_DROPOUT = 0.2
 
# Configure DistilBERT's initialization
config = DistilBertConfig(dropout=DISTILBERT_DROPOUT, 
                          attention_dropout=DISTILBERT_ATT_DROPOUT, 
                          output_hidden_states=True)
                          
# The bare, pre-trained DistilBERT transformer model outputting raw hidden-states 
# and without any specific head on top.
distilBERT = TFDistilBertModel.from_pretrained('distilbert-base-uncased', config=config)

# Make DistilBERT layers untrainable
for layer in distilBERT.layers:
    layer.trainable = False
2 Likes