Does masked language modeling DataCollator resembles BERT exactly? If not, how to do it like in BERT?

jonathanalis · February 13, 2022, 3:44am

I was reading the RoBERTa paper and seems that DataCollatorForLanguageModeling do not perform the same masked language model masking.
The authors describe the BERT masled text selection:
“BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged, and 10% are replaced by a randomly selected vocabulary token.”

Seems that the DataCollatorForLanguageModeling do not perform this replacing 80% with [MASK] 10% left unchanged and 10% randomly selected. It is not mentioned on documentation:

How can I do that in masked language models?
There is a way to use DataCollator to get the exact same text processing that BERT does?

Thanks in advance.

sgugger · February 14, 2022, 4:09pm

The DataCollatorForLanguageModeling does exactly what you’re describing by default.

Topic		Replies	Views
Using a dataset with already masked tokens Beginners	2	702	February 3, 2021
Fine-tuning BERT with deterministic masking instead of random masking Beginners	0	162	April 22, 2024
Code about DataCollatorForWholeWordMask in github 🤗Transformers	0	558	October 12, 2022
Fine tune Masked Language Model on custom dataset Beginners	5	6064	August 20, 2020
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1379	July 22, 2023

Does masked language modeling DataCollator resembles BERT exactly? If not, how to do it like in BERT?

Related topics