This is kind of a cross-library question, so maybe it belongs in the Tokenizers forum instead. But hopefully this is the right place.
I’ve made a custom Roberta-style BPE tokenizer for my project using the tokenizers library, with some useful preprocessors and other helpful goodies. I’m able to load the saved json into a tokenizers Tokenizer and it works as expected. But I’d like to load it as a transformers PreTrainedTokenizerFast instead, and I’m not sure if there’s a good way to do that.
I can pass
tokenizer_file="my_tokenizer.json" while creating a
PreTrainedTokenizerFast, but it doesn’t seem to read the json for the padding token information, and several methods raise a
NotImplementedError, so I assume that class is not meant to be used directly.
RobertaTokenizerFast requires vocab and merges files. So I created those as well, and I can pass them (redundantly), along with
tokenizer_file="my_tokenizer.json". But the resulting tokenizer isn’t respecting the json’s add_prefix_space configuration, so I also have to pass in
add_prefix_space=True to get my desired behavior.
This is making me wonder if I’m doing something wrong here. Is there a way to load a saved Tokenizers json file directly into some kind of transformers tokenizer?