What's the best way to load a saved Tokenizer json into a transformers PreTrainedTokenizerFast (or other transformers tokenizer)?

This is kind of a cross-library question, so maybe it belongs in the Tokenizers forum instead. But hopefully this is the right place.

I’ve made a custom Roberta-style BPE tokenizer for my project using the tokenizers library, with some useful preprocessors and other helpful goodies. I’m able to load the saved json into a tokenizers Tokenizer and it works as expected. But I’d like to load it as a transformers PreTrainedTokenizerFast instead, and I’m not sure if there’s a good way to do that.

I can pass tokenizer_file="my_tokenizer.json" while creating a PreTrainedTokenizerFast, but it doesn’t seem to read the json for the padding token information, and several methods raise a NotImplementedError, so I assume that class is not meant to be used directly.

Making a RobertaTokenizerFast requires vocab and merges files. So I created those as well, and I can pass them (redundantly), along with tokenizer_file="my_tokenizer.json". But the resulting tokenizer isn’t respecting the json’s add_prefix_space configuration, so I also have to pass in add_prefix_space=True to get my desired behavior.

This is making me wonder if I’m doing something wrong here. Is there a way to load a saved Tokenizers json file directly into some kind of transformers tokenizer?

1 Like

For now, you do have to specify all the information in the init (even if it’s also in the json). We’ll work on making that more seamless in the future.

1 Like

Got it. Thanks!

So to confirm, the best way right now is loading a specific fast tokenizer (e.g. RobertaTokenizerFast) and not PreTrainedTokenizerFast or AutoTokenizer.from_pretrained() (which I don’t believe has a tokenizer_file parameter)?

You will only be able to load with AutoTokenizer after doing a save_pretrained once you have loaded your tokenizer. Then RobertaTokenizerFast is better because it already has all the default special tokens, whereas you would need to give them all if you use PreTrainedTokenizerFast.

1 Like