In the roberta paper, the model is trained by dynamic masking of sentences.
If a roberta model is further pre-trained using the
run_mlm.py , is the sentences going to be dynamically masked during pre-training using this file or is it statically masked like vanilla BERT?
The script is found here
Thats actually cool
Have an awesome day Neils!
I have one small question,
What does the checkpoint folders that are created during the pre training. There folders are eating up my har drive space during training large text files(or when i was pre training line by line).
Hey I don’t quite understand what is being said in the discussion linked by @nielsr
Can someone please confirm the claim and explain it?
They are talking about dynamic masking that was mentioned in the RoBERTa paper. The original BERT paper did not use dynamic masking, but rather static masking meaning they masked all sentences before training and used those same masks across multiple epochs. Dynamic masking is better because it masks sentences during training, meaning that each epoch would have different masks. Since the masks are different each time, the learning is more challenging and thus converges slower than the original BERT approach.
All masked language models (including BertModel) implementations on Huggingface use Data Collators to mask sentences. Collators are inherently dynamic because they are applied during training. If you want to use static masking then you will have to code that yourself.
I hope that clarifies it for you.
Thanks for the explanaton! I still don’t quite get the part where the paper says each sentence is duplicated 10 times and passed into the masking function though. The DataCollator does not do that right?
It is explained in the Roberta paper, section 4.1 static vs dynamic masking. So basically they duplicated their corpus 10 times and masked each sentence in it. That way every sentence gets new masks. They did so over 40 epochs (duplicated corpus 10 * 4), so each sentence with the same masks is trained 4 times.
As for the DataCollator class (DataCollatorForLanguageModeling), you have to understand that their code doesn’t follow the original papers. The collator masks a batch of sentences in real-time. So no duplication, but still dynamic masking. The only difference is that the masks might be different each epoch across many epochs. Also, at least in the case of the torch implementation, masking is done on the batch, and not on each sentence, which means that sometimes sentences might have no masks whatsoever, and some other times sentences might end up with more than 15% masks. Took me a while to figure that out…
Thanks a lot for your reply. Really helped clear things up