Continual pre-training vs. Fine-tuning a language model with MLM

phosseini · July 20, 2021, 7:34pm

I have some custom data I want to use to further pre-train the BERT model. I’ve tried two following approaches so far:

Starting with a pre-trained BERT checkpoint and continuing the pre-training with Masked Language Modeling (MLM) + Next Sentence Prediction (NSP) heads (e.g. using BertForPreTraining model)
Starting with a pre-trained BERT model with the MLM objective (e.g. using the BertForMaskedLM model assuming we don’t need NSP for the pretraining part.)

But I’m still confused that if using either BertForPreTraining or BertForMaskedLM actually does the continual pre-training on BERT or these are just two models for fine-tuning that use MLM+NSP and MLM for fine-tuning BERT, respectively. Is there even any difference between fine-tuning BERT with MLM+NSP or continually pre-train it using these two heads or this is something we need to test?

Joy23333 · November 12, 2021, 4:00am

I have similar question here. I was following this tutorial, but still got quite confused when we call BertForMaskedLM how much weights we retained from the original BERT model? Please let me know if you figure out! many thanks in advance.

phosseini · November 26, 2021, 5:55am

First of all, I came to the conclusion that using BertForMaskedLM can basically be considered as further/continual pretraining. Even though some may call it fine-tuning using MLM (instead of continual pretraining), it is more of a difference in the use of terminology.

Second and to answer your question, my understanding is that we can start with a pre-trained BERT checkpoint which means we start with the weights from the pre-trained model and not from randomly initialized weights (if you’re using the Trainer by setting model=pretrained_BERT_model_checkpoint ), not sure what you mean by how much weights. However, one point to keep in mind is that further training using BertForMaskedLM using new data does not necessarily make the model better (in terms of whatever metric you use to measure the quality of the model) since there is a possibility that the model forgets what it has already learned.

Joy23333 · November 30, 2021, 12:40am

Thank you so much for your explaination! I think I understand now whenever we call a pretrained model with checkpoint/model name we start with the weights from pretrained models, while if we leave --model_name_or_path to None we randomly initial the weight for training from scratch.

phosseini · November 30, 2021, 1:45am

If you’re not passing the model then you need to set the model_init (take a look here) You can basically pass the configuration of your model (if you don’t want to start from a checkpoint) and Trainer will instantiate the model when the train function is called.

Joy23333 · November 30, 2021, 6:48am

Thanks for your futhuer explaination!

Topic		Replies	Views
Continual pre-training from an initial checkpoint with MLM and NSP Models	4	4283	September 8, 2021
Train from scratch vs further pretraining/fine tuning with MLM and NSP Research	1	1545	August 28, 2023
How to do unsupervised fine-tuning? 🤗Transformers	1	6947	January 29, 2021
How to train BERT from scratch on a new domain for both MLM and NSP? Models	2	2292	February 6, 2021
Pre-training BERT Models	1	381	May 21, 2024

Continual pre-training vs. Fine-tuning a language model with MLM

Related topics