Does XLM-R follows RoBERTa or XLM for MLM?

manirai91 · June 13, 2022, 9:12am

Hugging Face states that:

It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

While XLM-R paper states:

We follow the XLM approach as closely as possible, only introducing changes that improve performance at scale.

The confusion is RoBERTa uses dynamic masking whereas XLM uses static one. Also, RoBERTa uses 512 tokens max for input while XLM uses 256. Also, I didn’t understood the following XLM statement:

To counter the imbalance between rare and frequent tokens (e.g. punctuations or stop words), we also subsample the frequent outputs using an approach similar to Mikolov et al. (2013b): tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies.

Can somebody explain me what exactly is XLM-R doing in MLM?

Topic		Replies	Views
xlm-Roberta for mlm doesn't predict single one trained sentence properly Models	0	218	June 29, 2023
Trying to understand the task-specific head for diff. models + Transformers AutoModel 🤗Transformers	0	423	April 20, 2023
XLM-Roberta for many-topic classification Beginners	1	1165	December 31, 2021
Issue with XLM-RoBERTa tokenizer 🤗Tokenizers	1	301	August 15, 2024
How loss is calculated in MLM training 🤗Transformers	0	847	April 1, 2022

Does XLM-R follows RoBERTa or XLM for MLM?

Related topics