Data preprocessing steps for pretraining BERT from scratch

bengul · January 30, 2022, 4:01am

I am trying to pretrain BERT from scratch using the Huggingface BertForMaskedLM. I am only interested in masked language modeling. I have a lot of noob questions regarding the preprocessing steps. My guess is a lot of people are on the same boat as me. The questions are strictly about preprocessing including tokenization for BERT only. Any answer or suggestion is highly appreciated. Please feel free to add anything I missed as well.

Questions:

As I understand the dataset used to pretrain BERT (wikitext and book corpus) is available in the Dataset library. What are the differences between ‘wikitext-103-v1’ and ‘wikitext-103-raw-v1’? Which one is usually preferred?
Let’s say we go with ‘wikitext-103-v1’, there are a lot of headings, should we remove the headings?
Should we remove all urls?
Should we remove text from other languages? If yes, what is the best way to remove other language text?
Any special characters/text we should remove?
Is it necessary to split the text into sentences? Or are we going to be alright just choosing paragraphs of a certain length? When should we truncate the text before tokenization, during tokenization, or both? I am thinking to split each article into multiple paragraphs.
Is it necessary to train a tokenizer on the wikitext and bookcorpus? Or we can use the BertTokenizer?
What is the best way to handle memory issues? Should we preprocess the data during training or should we save the preprocessed data first? I am thinking to use Google Colab with TPU.
Any useful blog or tutorial?

Edit: The dataset is Wikipedia dataset not the wikitext

marshmellow77 · January 30, 2022, 7:38am

I believe these two tutorials should be helpful:

Main NLP tasks - Hugging Face Course. This one trains a Causal LM from scratch, whereas you seem more interested in training a Masked LM (e.g. BERT) from scratch. That being said, many of the same principles apply.
How to train a new language model from scratch using Transformers and Tokenizers.

Hope those are helpful!

Cheers
Heiko

Topic		Replies	Views
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4638	January 22, 2021
Bert Data Preparation Beginners	1	449	November 8, 2021
Original Bert Pretraining Intermediate	0	546	January 10, 2022
Train bert from scratch using run_mlm.py Beginners	0	803	March 25, 2022
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1379	July 22, 2023

Data preprocessing steps for pretraining BERT from scratch

Related topics