BERT MLM - 80% [MASK], 10% random words and 10% same word - how does this work?

petarulev · May 12, 2022, 11:21am

I have noticed that (from the original BERT paper) in the MLM training procedure, the authors decide to mask 15% of the words in a sentence. The mask works as following:

The masked words are distributed as follows

80% are replaced with [MASK] token (which makes perfect sense, just teach the model to learn some words given the left and right context)
10% are replaced by some random word. This makes some sense to me (deep learning - Why BERT model have to keep 10% MASK token unchanged? - Stack Overflow). My understanding is that this way the model learns to be influenced from the word it is trying to predict. That is, it does not consider only the left and right part of the sentence, but also the word itself. So masking with some random word would teach the model to actually consider the mirror words, but since the percentage is very small (1.5%), it would not confuse the model so much, so this might be beneficial.
10% of the words are unchanged. Now I completely don’t understand this. For example, I don’t understand the difference between: {90% masked with [MASK] and 10% masked with random word} and {80% [MASK], 10% random, 10% same word}. The authors indicate: The purpose of this is to bias the representation towards the actual observed word. Isn’t this the exact purpose of the random mirror word placed? The only thing that makes sense to me is that random word replacement teaches the model to consider the mirror word, and the same-word counters the effect of the random word so that the model doesn’t get confused logically?

Topic		Replies	Views
What's the inner mechanism of Masked Language Model in BERT Beginners	0	237	March 31, 2021
Is masking still used when finetuning a BERT model? Beginners	1	1322	July 29, 2020
How to customize BERT MLM task Beginners	6	1783	September 27, 2023
Mlm multi masking Beginners	4	62	December 19, 2024
How can I see the masked words during pre-learning by MLM? 🤗Transformers	0	252	February 7, 2022

BERT MLM - 80% [MASK], 10% random words and 10% same word - how does this work?

Related topics