When we are going to train a Masked LM, according to the amount of masked words, we will have a multi-label classification problem. For example if we have a vocab size of 30,000 and we have three masked words in the input of our model, then we have 29,997 zeros and only 3 ones.
I have two questions:
- In this kinds of model, do we update all the weights according to all 30,000 vocabs?
- Which loss function we use in these kinds of problems?(because we may have not actual one_hot encoding)