How is sentence piece model trained in XLM-R?

manirai91 · June 3, 2022, 9:38am

I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It’s because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.

Is it using sampling technique while training sentence piece as well?
If yes, how many times is sampling performed?
Isn’t it better to go through all the text in dataset to create sub-words vocab instead of just the samples?

Topic		Replies	Views
What is based model of XLM-RoBERTa Tokenizer? SenetencePiece? XLNetTokenizer 🤗Tokenizers	0	32	September 12, 2024
Subword regularization in Sentencepiece and DeBERTaV2 tokenizers (not working) 🤗Transformers	0	695	February 1, 2023
Best way to extend vocabulary of pretrained model? 🤗Transformers	3	2824	February 9, 2025
Using truncated fragments as input samples in training 🤗Tokenizers	3	683	July 1, 2021
How to instantiate a XLMRobertaTokenizer object using a locally trained SentencePiece tokenizer 🤗Tokenizers	0	294	May 14, 2023

How is sentence piece model trained in XLM-R?

Related topics