Convert slow XLMRobertaTokenizer to fast one

Posting a question I had which @SaulLu answered :smiley: suppose you have a repo on the hub that only has slow tokenizer files, and you want to be able to load a fast tokenizer. Hereā€™s how to do that:

!pip install -q transformers sentencepiece

model_name = "naver-clova-ix/donut-base-finetuned-docvqa"

from transformers import XLMRobertaTokenizerFast

tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name, from_slow=True)

tokenizer.save_pretrained("fast_tok", legacy_format=False)
1 Like

hey @nielsr @SaulLu , Thank you for this but what if you have fast tokenizers and you want to conver them to slow tokenizers? I am thinking that somehow you could parse the .json (when saving a fast tokenizer) and create the sentencpeice model you need for slow tokenizers from the json. what do you think?