Which loss function we use in Masked Language Modeling?

MahdiA · August 5, 2022, 5:59am

When we are going to train a Masked LM, according to the amount of masked words, we will have a multi-label classification problem. For example if we have a vocab size of 30,000 and we have three masked words in the input of our model, then we have 29,997 zeros and only 3 ones.
I have two questions:

In this kinds of model, do we update all the weights according to all 30,000 vocabs?
Which loss function we use in these kinds of problems?(because we may have not actual one_hot encoding)

Topic		Replies	Views
What's the inner mechanism of Masked Language Model in BERT Beginners	0	237	March 31, 2021
Bert LM pretraining: training loss goes to 0 at masking probability of 0.999 Beginners	2	2314	October 31, 2020
Continuing training masked LM: loss going up, performance going down 🤗Transformers	0	732	December 9, 2022
How does BERT only compute the softmax for the masked hidden vectors? Models	0	479	January 6, 2023
Question about loss computing in training masked-language-model Beginners	0	327	March 17, 2022

Which loss function we use in Masked Language Modeling?

Related topics