Training BERT from scratch with Wikipedia + Book Corpus Dataset

rio1210 · January 18, 2021, 3:05am

Hello, everyone! I am a person who woks in a different field of ML and someone who is not very familiar with NLP. Hence I am seeking your help!

I want to pre-train the standard BERT model with the wikipedia and book corpus dataset (which I think is the standard practice!) for a part of my research work.

I am following the huggingface guide to pretrain model from scratch: https://huggingface.co/blog/how-to-train

Now since they are training a different model on a different language dataset, in the article they mention:

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT)

So, in my case, should I go for WordPiece tokenizer for BERT pretraining? (I have a slight idea about tokenizer but I am not learned enough to understand the ramifications of this).

Apart from this, from the article the only other deviation I see is the selection of the dataset, I understand Huggingface has both the wikipedia and the book corpus datasets.

'2. So, how should I go about training? Should I train the model on Wikipedia first and then on Book Corpus? Or should I somehow concatenate them into a larger singular dataset. Any other thing should I keep in mind?

I would really appreciate if someone could point me to materials/code for pretraining BERT.
Any other tips/suggestions would be highly appreciated! Thanks a lot!

VP1 · January 22, 2021, 9:52am

Or should I somehow concatenate them into a larger singular dataset.

you would benefit from a bigger dataset;

should I go for WordPiece tokenizer for BERT pretraining?

BPE and WordPiece have a lot in common:
https://huggingface.co/transformers/tokenizer_summary.html
BERT is trained with WordPiece, so it is natural to choose WordPiece in this case.

Topic		Replies	Views
Pre-Train BERT (from scratch) Research	43	19005	June 27, 2022
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3888	January 30, 2022
Pre-trained models that weren't trained on Wikipedia? Intermediate	2	530	February 10, 2022
Training a tokenizer Beginners	1	446	August 3, 2022
pre-train_BERT for a specific corpus 🤗Transformers	0	72	May 2, 2024

Training BERT from scratch with Wikipedia + Book Corpus Dataset

Related topics