Is the huggingface run_mlm Script dynamically masked?

moma1820 · August 20, 2021, 9:51am

Hi,
In the roberta paper, the model is trained by dynamic masking of sentences.
If a roberta model is further pre-trained using the run_mlm.py , is the sentences going to be dynamically masked during pre-training using this file or is it statically masked like vanilla BERT?

The script is found here

nielsr · August 21, 2021, 8:33am

Apparently, dynamic masking is always used, no matter the model. See this: dynamic masking for RoBERTa model · Issue #5979 · huggingface/transformers · GitHub

moma1820 · August 21, 2021, 9:53am

Hello

Thats actually cool

Have an awesome day Neils!

moma1820 · August 25, 2021, 12:29pm

Hi again,
I have one small question,
What does the checkpoint folders that are created during the pre training. There folders are eating up my har drive space during training large text files(or when i was pre training line by line).

anon81579828 · May 27, 2022, 6:06am

Hey I don’t quite understand what is being said in the discussion linked by @nielsr
Can someone please confirm the claim and explain it?

lezakkaz · May 27, 2022, 4:47pm

They are talking about dynamic masking that was mentioned in the RoBERTa paper. The original BERT paper did not use dynamic masking, but rather static masking meaning they masked all sentences before training and used those same masks across multiple epochs. Dynamic masking is better because it masks sentences during training, meaning that each epoch would have different masks. Since the masks are different each time, the learning is more challenging and thus converges slower than the original BERT approach.

All masked language models (including BertModel) implementations on Huggingface use Data Collators to mask sentences. Collators are inherently dynamic because they are applied during training. If you want to use static masking then you will have to code that yourself.

I hope that clarifies it for you.

anon81579828 · May 29, 2022, 6:12am

Thanks for the explanaton! I still don’t quite get the part where the paper says each sentence is duplicated 10 times and passed into the masking function though. The DataCollator does not do that right?

lezakkaz · May 29, 2022, 3:17pm

It is explained in the Roberta paper, section 4.1 static vs dynamic masking. So basically they duplicated their corpus 10 times and masked each sentence in it. That way every sentence gets new masks. They did so over 40 epochs (duplicated corpus 10 * 4), so each sentence with the same masks is trained 4 times.

As for the DataCollator class (DataCollatorForLanguageModeling), you have to understand that their code doesn’t follow the original papers. The collator masks a batch of sentences in real-time. So no duplication, but still dynamic masking. The only difference is that the masks might be different each epoch across many epochs. Also, at least in the case of the torch implementation, masking is done on the batch, and not on each sentence, which means that sometimes sentences might have no masks whatsoever, and some other times sentences might end up with more than 15% masks. Took me a while to figure that out…

anon81579828 · June 1, 2022, 11:46pm

Thanks a lot for your reply. Really helped clear things up

Topic		Replies	Views
Does all masking during training take place in data_collator.py? 🤗Transformers	0	118	November 11, 2023
How can I see the masked words during pre-learning by MLM? 🤗Transformers	0	252	February 7, 2022
Masking specific token in each input sentence during Masked language modelling 🤗Transformers	0	1047	October 18, 2021
Creating masked sentences 🤗Datasets	1	411	March 2, 2022
Sequence masking 🤗Transformers	0	381	April 25, 2022

Is the huggingface run_mlm Script dynamically masked?

Related topics