Bert Data Preparation

surajmodi1995 · November 7, 2021, 5:25pm

I am trying to pre train a bert type model from scratch. It will be a bert tiny model .
I will append the wikipedia data along with some of my own data.
For wiki data I can download using hugging face datasets library. The question I have is what kind of cleaning I need to do after that. There are some non ascii characters in the datasets. Should I remove or mormalize them?
Does the real bert pre training used the wiki data with all these non ascii characters? anyone who knows this or have replicated the pre training results?
Thanks in advance.

mariosasko · November 8, 2021, 12:05pm

Moving this topic to the Beginners category.

Topic		Replies	Views
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3856	January 30, 2022
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4635	January 22, 2021
Training BERT model from scratch with custom sequence Beginners	0	392	September 21, 2022
Tutorial on Pretraining BERT Beginners	1	537	December 15, 2020
Pre-Train BERT (from scratch) Research	43	18983	June 27, 2022

Bert Data Preparation

Related topics