MBart50Tokenizer vs XLMRobertaTokenizer

yoshi · July 19, 2021, 4:15pm

Hello,

I am currently working on the MBART50 many-to-one model for translation. I noticed that the tokenizer took a long time to encode the input when the length of the sequence became relatively important (cf attached picture). So, instead of using the MBart50TokenizerFast I used to use the XLMRobertaTokenizerFast.

Actually, both are based on sentencepiece.bpe.model and have almost the same vocabulary. Using XLMRobertaTokenizerFast, I got practically the same translation as if I had used MBart50TokenizerFast BUT with a much shorter execution time. Does anyone have any idea why there is so much difference in execution time between these two tokenizers? (One can find the evolution of the execution time - encoding time - depending on the input-string length of the two tokenizers in attached picture)

Thank you !

Code used :
mbart_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
xlm_tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")

mbart_token = mbart_tokenizer(input, return_tensors='pt')
xlm_token = xlm_tokenizer(input, return_tensors='pt')

téléchargement (1)

Topic		Replies	Views
BertTokenizerFast for stsb-xlm-r-multilingual model 🤗Tokenizers	3	662	April 8, 2021
Translation takes too long - from fine-tuned mbart-large-50 model Beginners	0	406	September 7, 2021
DataCollator for training mbart50 for translation with custom dataset Beginners	0	346	June 24, 2021
Difference between tokenizer and tokenizerfast Beginners	4	4233	December 22, 2023
Increase the speed of the Mbart model Beginners	1	646	September 28, 2023

MBart50Tokenizer vs XLMRobertaTokenizer

Related topics