BertTokenizerFast for stsb-xlm-r-multilingual model

Matthieu · March 19, 2021, 7:28pm

Hi community,

Would there be a fast tokenizer for the stsb-xlm-r-multilingual model?

Thanks !

Matthieu · April 8, 2021, 7:04am

Hi community and @lewtun,

Could anyone have an idea on how to get a fast tokenizer for stsb-xlm-r-multilingual model?

I am blocked with low latency response due to tokenizer computation. Is there a fast tokenizer model as BertTokenizerFast or is there a way to run tokenizer on GPU ?

lewtun · April 8, 2021, 7:38am

hey @Matthieu, as far as i know the “fast” refers to the rust implementations of the tokenizers: tokenizers/tokenizers at master · huggingface/tokenizers · GitHub

there are bindings for python, so perhaps you can adapt the suggestion here to your use case? e.g. download the tokenizer.json file for stsb and load the fast version as follows:

from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast(tokenizer_file="tokenizer.json")

(i’m not super familiar with the stsb-xlm-r-multilingual model but am assuming it’s using the same tokenization strategy as XLM-R)

Matthieu · April 8, 2021, 9:19pm

Hi @lewtun thanks. I finally found that there is a xlmrobertatokenizerfast implementation.

Topic		Replies	Views
MBart50Tokenizer vs XLMRobertaTokenizer 🤗Tokenizers	0	483	July 19, 2021
🚀 Introducing FlashTokenizer: The World's Fastest CPU Tokenizer! 🤗Transformers	2	34	April 4, 2025
ESM Fast Tokenizer Beginners	0	135	January 15, 2024
OSError: Can't load tokenizer for 'facebook/xmod-base' 🤗Tokenizers	1	1223	October 6, 2023
Convert slow XLMRobertaTokenizer to fast one 🤗Transformers	3	1189	August 26, 2024

BertTokenizerFast for stsb-xlm-r-multilingual model

Related topics