What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)?

jncasey · February 25, 2021, 5:50pm

This is kind of a cross-library question, so maybe it belongs in the Tokenizers forum instead. But hopefully this is the right place.

I’ve made a custom Roberta-style BPE tokenizer for my project using the tokenizers library, with some useful preprocessors and other helpful goodies. I’m able to load the saved json into a tokenizers Tokenizer and it works as expected. But I’d like to load it as a transformers PreTrainedTokenizerFast instead, and I’m not sure if there’s a good way to do that.

I can pass tokenizer_file="my_tokenizer.json" while creating a PreTrainedTokenizerFast, but it doesn’t seem to read the json for the padding token information, and several methods raise a NotImplementedError, so I assume that class is not meant to be used directly.

Making a RobertaTokenizerFast requires vocab and merges files. So I created those as well, and I can pass them (redundantly), along with tokenizer_file="my_tokenizer.json". But the resulting tokenizer isn’t respecting the json’s add_prefix_space configuration, so I also have to pass in add_prefix_space=True to get my desired behavior.

This is making me wonder if I’m doing something wrong here. Is there a way to load a saved Tokenizers json file directly into some kind of transformers tokenizer?

sgugger · February 25, 2021, 8:35pm

For now, you do have to specify all the information in the init (even if it’s also in the json). We’ll work on making that more seamless in the future.

jncasey · February 25, 2021, 9:40pm

Got it. Thanks!

So to confirm, the best way right now is loading a specific fast tokenizer (e.g. RobertaTokenizerFast) and not PreTrainedTokenizerFast or AutoTokenizer.from_pretrained() (which I don’t believe has a tokenizer_file parameter)?

sgugger · February 25, 2021, 11:50pm

You will only be able to load with AutoTokenizer after doing a save_pretrained once you have loaded your tokenizer. Then RobertaTokenizerFast is better because it already has all the default special tokens, whereas you would need to give them all if you use PreTrainedTokenizerFast.

Topic		Replies	Views
How to save a fast tokenizer using the transformer library and then load it using Tokenizers? 🤗Tokenizers	7	3451	December 14, 2022
Loading local tokenizer (RobertaTokenizerFast.from_pretrained) 🤗Transformers	0	1629	June 14, 2023
Is there a way to save a pre-compiled AutoTokenizer? 🤗Tokenizers	1	351	January 25, 2024
Loading SentencePiece tokenizer Beginners	3	5032	October 24, 2023
Pipeline fill-mask error with custom Roberta tokenizer Beginners	1	847	February 8, 2022

What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)?

Related topics