This is kind of a cross-library question, so maybe it belongs in the Tokenizers forum instead. But hopefully this is the right place.
I’ve made a custom Roberta-style BPE tokenizer for my project using the tokenizers library, with some useful preprocessors and other helpful goodies. I’m able to load the saved json into a tokenizers Tokenizer and it works as expected. But I’d like to load it as a transformers PreTrainedTokenizerFast instead, and I’m not sure if there’s a good way to do that.
I can pass tokenizer_file="my_tokenizer.json"
while creating a PreTrainedTokenizerFast
, but it doesn’t seem to read the json for the padding token information, and several methods raise a NotImplementedError
, so I assume that class is not meant to be used directly.
Making a RobertaTokenizerFast
requires vocab and merges files. So I created those as well, and I can pass them (redundantly), along with tokenizer_file="my_tokenizer.json"
. But the resulting tokenizer isn’t respecting the json’s add_prefix_space configuration, so I also have to pass in add_prefix_space=True
to get my desired behavior.
This is making me wonder if I’m doing something wrong here. Is there a way to load a saved Tokenizers json file directly into some kind of transformers tokenizer?