Bert Data Preparation

I am trying to pre train a bert type model from scratch. It will be a bert tiny model .
I will append the wikipedia data along with some of my own data.
For wiki data I can download using hugging face datasets library. The question I have is what kind of cleaning I need to do after that. There are some non ascii characters in the datasets. Should I remove or mormalize them?
Does the real bert pre training used the wiki data with all these non ascii characters? anyone who knows this or have replicated the pre training results?
Thanks in advance.

Moving this topic to the Beginners category.