Mlm multi masking

mattsauer · December 18, 2024, 3:24am

Hello I have a question about mlm. With mlm typically the process masks a proportion of the words and the model has to predict the masked word actual token. How do models handle multiple masks in in sequence? It seems models only predict one word at a time as the head should be a SoftMax of the vocabulary size correct?

Alanturner2 · December 18, 2024, 8:10am

During the training the multi head attention the model can learn the structure and grammar and so on. In short words they can learn linguistics. You should know in this part. They don’t only mask one word but also mask many words.
After do that, they are focusing the special tasks such as QA, summarization, generation and so on.

mattsauer · December 18, 2024, 4:20pm

Thanks for the response. But I don think I made my question clear I know they mask multiple words but how do they handle predicting the multiple masks in one sample/sequence? If there was only one token to predict I would understand how they do this but I do not get how they handle multiple. Is the fill mask task only a subset of mlm?

Alanturner2 · December 19, 2024, 12:16am

During masked multiple words they predict the word one by one. Then it means they predict the one word than predict the next word. So first word’s probability p1 and next word probability p2 than the masked tokens probability is equal p1 * p2.

system · December 19, 2024, 12:17pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
BERT MLM - 80% [MASK], 10% random words and 10% same word - how does this work? 🤗Transformers	0	1188	May 12, 2022
Fill-mask and classification at the same time Beginners	4	803	March 18, 2022
MLM vs CLM, can be exchanged? Models	0	1053	August 21, 2022
Does BERT Use Two Segments of a Sequence When Predicting the Masks? Beginners	0	313	June 14, 2022
Fill mask model that supports multiple masks Beginners	1	440	October 20, 2022

Mlm multi masking

Related topics