I am currently working on the MBART50 many-to-one model for translation. I noticed that the tokenizer took a long time to encode the input when the length of the sequence became relatively important (cf attached picture). So, instead of using the MBart50TokenizerFast I used to use the XLMRobertaTokenizerFast.
Actually, both are based on sentencepiece.bpe.model and have almost the same vocabulary. Using XLMRobertaTokenizerFast, I got practically the same translation as if I had used MBart50TokenizerFast BUT with a much shorter execution time. Does anyone have any idea why there is so much difference in execution time between these two tokenizers? (One can find the evolution of the execution time - encoding time - depending on the input-string length of the two tokenizers in attached picture)
Thank you !
Code used :
mbart_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt")
xlm_tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base")
mbart_token = mbart_tokenizer(input, return_tensors='pt')
xlm_token = xlm_tokenizer(input, return_tensors='pt')