Continual pre-training vs. Fine-tuning a language model with MLM

I have some custom data I want to use to further pre-train the BERT model. I’ve tried two following approaches so far:

  1. Starting with a pre-trained BERT checkpoint and continuing the pre-training with Masked Language Modeling (MLM) + Next Sentence Prediction (NSP) heads (e.g. using BertForPreTraining model)
  2. Starting with a pre-trained BERT model with the MLM objective (e.g. using the BertForMaskedLM model assuming we don’t need NSP for the pretraining part.)

But I’m still confused that if using either BertForPreTraining or BertForMaskedLM actually does the continual pre-training on BERT or these are just two models for fine-tuning that use MLM+NSP and MLM for fine-tuning BERT, respectively. Is there even any difference between fine-tuning BERT with MLM+NSP or continually pre-train it using these two heads or this is something we need to test?

1 Like

I have similar question here. I was following this tutorial, but still got quite confused when we call BertForMaskedLM how much weights we retained from the original BERT model? Please let me know if you figure out! many thanks in advance.

First of all, I came to the conclusion that using BertForMaskedLM can basically be considered as further/continual pretraining. Even though some may call it fine-tuning using MLM (instead of continual pretraining), it is more of a difference in the use of terminology.

Second and to answer your question, my understanding is that we can start with a pre-trained BERT checkpoint which means we start with the weights from the pre-trained model and not from randomly initialized weights (if you’re using the Trainer by setting model=pretrained_BERT_model_checkpoint ), not sure what you mean by how much weights. However, one point to keep in mind is that further training using BertForMaskedLM using new data does not necessarily make the model better (in terms of whatever metric you use to measure the quality of the model) since there is a possibility that the model forgets what it has already learned.

2 Likes

Thank you so much for your explaination! :hugs: I think I understand now whenever we call a pretrained model with checkpoint/model name we start with the weights from pretrained models, while if we leave --model_name_or_path to None we randomly initial the weight for training from scratch.

If you’re not passing the model then you need to set the model_init (take a look here) You can basically pass the configuration of your model (if you don’t want to start from a checkpoint) and Trainer will instantiate the model when the train function is called.

1 Like

Thanks for your futhuer explaination!