Pre-Train BERT (from scratch)

BERT has been trained on MLM and NSP objective. I wanted to train BERT with/without NSP objective (with NSP in case suggested approach is different). I haven’t performed pre-training in full sense before. Can you please share how to obtain the data (crawl and tokenization details which were used) on which BERT was trained on ?. Since it takes a lot of time, I am looking for well tested code that can yield the BERT with/without NSP in one go. Any suggestions will be helpful.
I know about some projects like these, but they won’t integrate well with transformers well I guess which is a must have condition in my case.


BERT was trained on book corpus and english wikipedia both of which are available in dataset library

Transformers has recently included dataset for for next sent prediction which you could use

and there’s also NSP head for BERT

BertForPreTraining class can be used for both MLM and NSP

with the current example/languae-modeling I guess it’s only possible to either use MLM or NSP, you might need to write your own script to combine these.


For training on MLM objective, is it recommended to use collate_fn from here? Didn’t see TextDataset for MLM objective.

Masking is done using DataCollatorForLanguageModeling so you can use any dataset and just pass the collator to DataLoader.

One thing to note:
DataCollatorForLanguageModeling does dynamic masking but BERT was trained using static masking .

1 Like

It seems that using BertForNextSentencePrediction with TextDatasetForNextSentencePrediction and DataCollatorForLanguageModeling would be equivalent to the BERT objective (except static masking part). And for dataset, we can use datasets.concatenate_datasets() method for BookCorpus and Wikipedia. This might be close right ? Any additional details ?


datasets.concatenate_datasets() does not seem to work for this since features do not match. Also BertForNextSentencePrediction expects a file_path. Initially I thought it was a wrapper which can take datasets objects.

It shouldn’t be hard to convert BertForNextSentencePrediction to use datasets. I played with wikipedia dataset for english just now. Each dataset entry is an article/document and it needs to be sentence tokenized in BertForNextSentencePrediction . Book corpus dataset entries seem to be sentences already. Let me know about your progress.

How are you measuring the metric ?

I don’t yet. I am still setting up these training pipelines. I asked about metrics at Evaluation metrics for BERT-like LMs but no response yet. I read at and elsewhere that perplexity is not appropriate for BERT and MLMs. Can’t we use fill-mask pipeline and some version of masking accuracy?

OTOH, I’ve already setup GLUE benchmarks with v2 Alpha. Excellent integration with transformers and can easily plugin any model and run benchmarks in parallel. See for more details

Did you try using Cross Entropy for pre-training ? We usually use that for MLM. It can be easily used for NSP I guess.

Indeed wikipedia has columns “text” and “title” while bookcorpus only has “text”.
You can concatenate them by removing the “title” column from wikipedia:

from datasets import load_dataset, concatenate_datasets

wiki = load_dataset("wikipedia", "20200501.en", split="train")
bookcorpus = load_dataset("bookcorpus", split="train")
print(wiki.column_names, bookcorpus.column_names)
# ['title', 'text'] ['text']

bert_dataset = concatenate_datasets([wiki, bookcorpus])

Let me know if you find an appropriate way to cut wikipedia articles into sentences !
Also don’t hesitate if you have any questions about dataset processing, I’d be happy to help :slight_smile:

You can use spaCy or stanza for sentence segmentation. spaCy is quite a bit faster but might be less correct. If you want to I can post a segmentation function here.

1 Like

So after concatenation of wikipedia and book_corpus, next things to do is NSP. Can you suggest how that is to be done on object after concatenation happens?
I do not want to diverge from the actual method which was used to pre-train BERT.

You can have a look here:

Has anyone replicated BERT pre-training from scratch ? It would be good to hear what exactly did they do.

I already saw it. I tried using it, but got stuck with other things such as metric, preprocessing etc. Given that training will last for a week, there is not much scope to make errors.

Also, is there some study or has anyone experimented what happens if we solely rely on MLM and no NSP. How much difference will that make ? RoBERTa showed that NSP didn’t prove to be useful. In this case, does involving NSP help with MLM ?

Well as you found, RoBERTa showed that leaving out NSP yields better results on downstream tasks. Albert then re-added a similar (yet very different) task, namely sentence order prediction, which improved performance on downstream tasks.

PS: please don’t post multiple consecutive posts but rather edit your posts to add more information. It’s a bit annoying with the notifications. :slight_smile:


Quentin, I am not sure dataset itself should cut articles into sentences (unless there is an option for both articles/sentences). Perhaps other models might need entire articles as input. If needed, users can sentence tokenize articles using nltk/spacy and such. I’ll play with the wikipedia dataset in the coming days and I’ll report back to you my experiences. Also, while looking at the dataset I found references to Categories and such. Perhaps equally important objective for wikipedia dateset is to keep it as clean as possible.