BERT pre-training run_mlm_flax.py questions

Hey, I am setting up a pre-training enviroment for RoBERTa… I have a few questions.

  1. I will use Wikipedia to pre-train RoBERTa. I noticed though the wiki dataset is cleaned, there are h2 headings, links, as well as other non-sentence things. Do I remove them beforehand to get a better result? I would say yes, but I can’t see anything like that in the run_mlm_flax.py script.
  2. I am currently at line 437 in the aforementioned script. The next step would be tokenizing each wiki article. The script in line 452 suggests that the group_texts function is the "Main data processing function that will concatenate all texts from our dataset and generate chunks of max_seq_length." My question is, am I not losing lots of the text here? How is RoBERTa trained, do I just continuously feed sequences of length max_seq_length? I thought the batches will be sentences. And if I really just feed these long sequences, doesn’t it make sense to split the wiki articles beforehand into chunks < max_seq_length such that I don’t lose sentences - and just pad the rest?
  3. Also, please, could you answer this question of mine: python - NonMatchingSplitsSizesError when loading huggingface dataset - Stack Overflow.
  4. I may be encountering other issues soon, should I open a new thread or use this one in case I can’t resolve it myself / there is too much uncertainty?

Thanks.