Does masked language modeling DataCollator resembles BERT exactly? If not, how to do it like in BERT?

I was reading the RoBERTa paper and seems that DataCollatorForLanguageModeling do not perform the same masked language model masking.
The authors describe the BERT masled text selection:
“BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged, and 10% are replaced by a randomly selected vocabulary token.”

Seems that the DataCollatorForLanguageModeling do not perform this replacing 80% with [MASK] 10% left unchanged and 10% randomly selected. It is not mentioned on documentation:

How can I do that in masked language models?
There is a way to use DataCollator to get the exact same text processing that BERT does?

Thanks in advance.

The DataCollatorForLanguageModeling does exactly what you’re describing by default.

1 Like