I’m trying to further pre-train a language model (BERT here) not from scratch but from an initial checkpoint using my own data. My goal is to later use these further pre-trained models for fine-tuning on some downstream tasks (I have no issue with the fine-tuning part). For the pre-training, I want to use both Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) heads (the same way that BERT is pre-trained where the model’s total loss is the sum of MLM loss and NSP loss). My data is stored in a text file following the standard format for BERT input (each document has multiple sentences separated by newlines and documents are separated by an empty line):
sentence 1.1 sentence 1.2 empty line sentence 2.1 sentence 2.2
I have two specific questions and I appreciate any feedback:
I have some trouble finding the right function/script in the transformers library for such a purpose. As far as I understand, all the scripts for language modeling only use MLM for pretraining (correct me if I’m wrong.) I wonder if I should use BertForPreTraining for this purpose?
Assuming I should use BertForPreTraining, I wonder how I should prepare my data for this model. I’m looking for the right object or data type/format and the right way of tokenizing my input data so that it’s suitable both for MLM and NSP.