BERT pre-training run_mlm_flax.py questions

marcelb9 · November 3, 2021, 5:12pm

Hey, I am setting up a pre-training enviroment for RoBERTa… I have a few questions.

I will use Wikipedia to pre-train RoBERTa. I noticed though the wiki dataset is cleaned, there are h2 headings, links, as well as other non-sentence things. Do I remove them beforehand to get a better result? I would say yes, but I can’t see anything like that in the run_mlm_flax.py script.
I am currently at line 437 in the aforementioned script. The next step would be tokenizing each wiki article. The script in line 452 suggests that the group_texts function is the "Main data processing function that will concatenate all texts from our dataset and generate chunks of max_seq_length." My question is, am I not losing lots of the text here? How is RoBERTa trained, do I just continuously feed sequences of length max_seq_length? I thought the batches will be sentences. And if I really just feed these long sequences, doesn’t it make sense to split the wiki articles beforehand into chunks < max_seq_length such that I don’t lose sentences - and just pad the rest?
Also, please, could you answer this question of mine: python - NonMatchingSplitsSizesError when loading huggingface dataset - Stack Overflow.
I may be encountering other issues soon, should I open a new thread or use this one in case I can’t resolve it myself / there is too much uncertainty?

Thanks.

Topic		Replies	Views
Training RoBERTa from scratch: error? 🤗Transformers	0	589	August 26, 2021
Pre-Training From Scratch 🤗Transformers	0	1003	October 6, 2021
Data-prep for new portuguese RoBERTa from scratch Models	4	410	May 20, 2021
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3891	January 30, 2022
Pre-Train BERT (from scratch) Research	43	19005	June 27, 2022