How to train BERT from scratch on a new domain for both MLM and NSP?

I’m trying to train BERT model from scratch using my own dataset. I would like to train the model in a way that it has the exact architecture of the original BERT model.

In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”

I’m trying to understand how to train the model on two tasks as above. At the moment, I initizalied the model as below:

from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)

However, it would just be for MLM and not NSP. How can I initialize and train the model with NSP as well?

My assumptions would be either

  1. Initialize with BertForPreTraining (for both MLM and NSP),
  2. After finish training with BertForMaskedLM, initalize the same model and train again with BertForNextSentencePrediction (but this approach’s computation and resources would cost twice…)

I’m not sure which one is the correct way. Or maybe my original approach was fine as it is?
Any insights or advice would be greatly appreciated.

1 Like

Hi @tlqnguyen

For MLM and NSP training, you should use the BertForPreTraining class. When you pass labels to the forward it will do MLM and when you pass next_sentence_label it’ll do NSP

Hi @valhalla,

Thank you so much for your suggestion. I have a quick follow up question on this. When we train with NSP, does the sentences in corpus need to be label with Sentence A or B? Or I can just train an unannotated corpus?