Train from scratch vs further pretraining/fine tuning with MLM and NSP

Hello all!

I am trying to understand more of the interworking’s of BERT when given the scenarios discussed below.

Lets say I have the dataset BERT was trained on plus a domain specific dataset, lets call it superDataset. What is the difference in the following,

  • Train Bert from scratch with superDataset
  • Start with pretrained BERT, fine-tune with MLM and NSP with domain specific dataset.

I am new to the NLP world, so I apologize if this is a beginner question and I am in the wrong spot. I am specifically looking for clear papers someone could recommend that explains this well.

Thanks everyone!

Hi :slight_smile:

First of all, do not apologize for asking questions, forum is specially designed for such purposes.
Training from scratch is often called pre-training and is designed to deliver some general lingustic “knowledge” to the model. It means that probably we would not like to pre-train the model with superDataset, because we need loads of data in order to pre-train LLM.

What we often do is to take the pre-trained LLM (such as BERT), which already has “seen” some general dependencies and relationships in the language, and then pass domain specific dataset. We adjust the weights of LLM, so we fine-tune the model to our needs.

What you have to also know is that MLM and NSP are generally pre-training task, we do not use them in the process of fine-tuning. There was some research about performing further pre-training on domain specific dataset to achieve higher performance during fine-tuning. If you are interested, you can have a look there