Hugging Face states that:
It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
While XLM-R paper states:
We follow the XLM approach as closely as possible, only introducing changes that improve performance at scale.
The confusion is RoBERTa uses dynamic masking whereas XLM uses static one. Also, RoBERTa uses 512 tokens max for input while XLM uses 256. Also, I didn’t understood the following XLM statement:
To counter the imbalance between rare and frequent tokens (e.g. punctuations or stop words), we also subsample the frequent outputs using an approach similar to Mikolov et al. (2013b): tokens in a text stream are sampled according to a multinomial distribution, whose weights are proportional to the square root of their invert frequencies.
Can somebody explain me what exactly is XLM-R doing in MLM?