Hi guys. I just finished reading the original BERT paper, but get confused about the masked language model. 80% of the selected words are masked, 10% are maintained and 10% are replaced with a random word. What happens to the learning process of the model under the three conditions respectively?
Besides, in my opinion, if the masked word is not masked, the model can still learn something from the sentence, why there is a label leakage problem? Are there any straightforward explanations for the label leakage problem?