Hello I have a question about mlm. With mlm typically the process masks a proportion of the words and the model has to predict the masked word actual token. How do models handle multiple masks in in sequence? It seems models only predict one word at a time as the head should be a SoftMax of the vocabulary size correct?
During the training the multi head attention the model can learn the structure and grammar and so on. In short words they can learn linguistics. You should know in this part. They don’t only mask one word but also mask many words.
After do that, they are focusing the special tasks such as QA, summarization, generation and so on.
Thanks for the response. But I don think I made my question clear I know they mask multiple words but how do they handle predicting the multiple masks in one sample/sequence? If there was only one token to predict I would understand how they do this but I do not get how they handle multiple. Is the fill mask task only a subset of mlm?
During masked multiple words they predict the word one by one. Then it means they predict the one word than predict the next word. So first word’s probability p1 and next word probability p2 than the masked tokens probability is equal p1 * p2.
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.