Data preprocessing steps for pretraining BERT from scratch

I am trying to pretrain BERT from scratch using the Huggingface BertForMaskedLM. I am only interested in masked language modeling. I have a lot of noob questions regarding the preprocessing steps. My guess is a lot of people are on the same boat as me. The questions are strictly about preprocessing including tokenization for BERT only. Any answer or suggestion is highly appreciated. Please feel free to add anything I missed as well.


  1. As I understand the dataset used to pretrain BERT (wikitext and book corpus) is available in the Dataset library. What are the differences between ‘wikitext-103-v1’ and ‘wikitext-103-raw-v1’? Which one is usually preferred?
  2. Let’s say we go with ‘wikitext-103-v1’, there are a lot of headings, should we remove the headings?
  3. Should we remove all urls?
  4. Should we remove text from other languages? If yes, what is the best way to remove other language text?
  5. Any special characters/text we should remove?
  6. Is it necessary to split the text into sentences? Or are we going to be alright just choosing paragraphs of a certain length? When should we truncate the text before tokenization, during tokenization, or both? I am thinking to split each article into multiple paragraphs.
  7. Is it necessary to train a tokenizer on the wikitext and bookcorpus? Or we can use the BertTokenizer?
  8. What is the best way to handle memory issues? Should we preprocess the data during training or should we save the preprocessed data first? I am thinking to use Google Colab with TPU.
  9. Any useful blog or tutorial?

Edit: The dataset is Wikipedia dataset not the wikitext

I believe these two tutorials should be helpful:

Hope those are helpful!


1 Like