Continual pre-training from an initial checkpoint with MLM and NSP

I’m trying to further pre-train a language model (BERT here) not from scratch but from an initial checkpoint using my own data. My goal is to later use these further pre-trained models for fine-tuning on some downstream tasks (I have no issue with the fine-tuning part). For the pre-training, I want to use both Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) heads (the same way that BERT is pre-trained where the model’s total loss is the sum of MLM loss and NSP loss). My data is stored in a text file following the standard format for BERT input (each document has multiple sentences separated by newlines and documents are separated by an empty line):


sentence 1.1

sentence 1.2

empty line

sentence 2.1

sentence 2.2

I have two specific questions and I appreciate any feedback:

  1. I have some trouble finding the right function/script in the transformers library for such a purpose. As far as I understand, all the scripts for language modeling only use MLM for pretraining (correct me if I’m wrong.) I wonder if I should use BertForPreTraining for this purpose?

  2. Assuming I should use BertForPreTraining, I wonder how I should prepare my data for this model. I’m looking for the right object or data type/format and the right way of tokenizing my input data so that it’s suitable both for MLM and NSP.

1 Like

Hi.

I want to do exactly same as you. Did you find any answer out there? How were things going on with your approach? I appreciate any advice that let me avoid any headache.

Thanks in advance.