Train from scratch vs further pretraining/fine tuning with MLM and NSP

Hello all!

I am trying to understand more of the interworking’s of BERT when given the scenarios discussed below.

Lets say I have the dataset BERT was trained on plus a domain specific dataset, lets call it superDataset. What is the difference in the following,

  • Train Bert from scratch with superDataset
  • Start with pretrained BERT, fine-tune with MLM and NSP with domain specific dataset.

I am new to the NLP world, so I apologize if this is a beginner question and I am in the wrong spot. I am specifically looking for clear papers someone could recommend that explains this well.

Thanks everyone!