BERT MLM - 80% [MASK], 10% random words and 10% same word - how does this work?

I have noticed that (from the original BERT paper) in the MLM training procedure, the authors decide to mask 15% of the words in a sentence. The mask works as following:

The masked words are distributed as follows

  1. 80% are replaced with [MASK] token (which makes perfect sense, just teach the model to learn some words given the left and right context)
  2. 10% are replaced by some random word. This makes some sense to me (deep learning - Why BERT model have to keep 10% MASK token unchanged? - Stack Overflow). My understanding is that this way the model learns to be influenced from the word it is trying to predict. That is, it does not consider only the left and right part of the sentence, but also the word itself. So masking with some random word would teach the model to actually consider the mirror words, but since the percentage is very small (1.5%), it would not confuse the model so much, so this might be beneficial.
  3. 10% of the words are unchanged. Now I completely don’t understand this. For example, I don’t understand the difference between: {90% masked with [MASK] and 10% masked with random word} and {80% [MASK], 10% random, 10% same word}. The authors indicate: The purpose of this is to bias the representation towards the actual observed word. Isn’t this the exact purpose of the random mirror word placed? The only thing that makes sense to me is that random word replacement teaches the model to consider the mirror word, and the same-word counters the effect of the random word so that the model doesn’t get confused logically?