How to create a Huggingface tokenizer from a non-Huggingface tokenizer?

Hello,

The title is self-explanatory. I have access to a published Bert model with its custom tokenizer. I have the vocab file, and python functions that receive a text, tokenize it according to the vocab, and do some post-processing and convert them to IDs acceptable by Bert model.

I would like to transform the custom tokenizer, which is a hassle to work with, to a :hugs: tokenizer so I can use all of amazing functionalities that :hugs: and other :hugs: -based libraries provide. I already managed to transform the Bert model to a :hugs: BertModel, but tokenzier seems to be trickier.

Is there a way I can somehow transform this non-:hugs: tokenizer to a :hugs: tokenizer? Is PreTrainedTokenizer what I need to use?

1 Like