I’m trying to train BERT model from scratch using my own dataset. I would like to train the model in a way that it has the exact architecture of the original BERT model.
In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”
I’m trying to understand how to train the model on two tasks as above. At the moment, I initizalied the model as below:
from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)
However, it would just be for MLM and not NSP. How can I initialize and train the model with NSP as well?
My assumptions would be either
- Initialize with
BertForPreTraining
(for both MLM and NSP),
OR - After finish training with
BertForMaskedLM
, initalize the same model and train again withBertForNextSentencePrediction
(but this approach’s computation and resources would cost twice…)
I’m not sure which one is the correct way. Or maybe my original approach was fine as it is?
Any insights or advice would be greatly appreciated.