Hey, I am setting up a pre-training enviroment for RoBERTa… I have a few questions.
- I will use Wikipedia to pre-train RoBERTa. I noticed though the wiki dataset is cleaned, there are h2 headings, links, as well as other non-sentence things. Do I remove them beforehand to get a better result? I would say yes, but I can’t see anything like that in the
run_mlm_flax.py
script. - I am currently at
line 437
in the aforementioned script. The next step would be tokenizing each wiki article. The script inline 452
suggests that thegroup_texts
function is the"Main data processing function that will concatenate all texts from our dataset and generate chunks of max_seq_length."
My question is, am I not losing lots of the text here? How is RoBERTa trained, do I just continuously feed sequences of length max_seq_length? I thought the batches will be sentences. And if I really just feed these long sequences, doesn’t it make sense to split the wiki articles beforehand into chunks < max_seq_length such that I don’t lose sentences - and just pad the rest? - Also, please, could you answer this question of mine: python - NonMatchingSplitsSizesError when loading huggingface dataset - Stack Overflow.
- I may be encountering other issues soon, should I open a new thread or use this one in case I can’t resolve it myself / there is too much uncertainty?
Thanks.