How to create a Huggingface tokenizer from a non-Huggingface tokenizer?

ChaosNLP · May 4, 2021, 6:01pm

Hello,

The title is self-explanatory. I have access to a published Bert model with its custom tokenizer. I have the vocab file, and python functions that receive a text, tokenize it according to the vocab, and do some post-processing and convert them to IDs acceptable by Bert model.

I would like to transform the custom tokenizer, which is a hassle to work with, to a tokenizer so I can use all of amazing functionalities that and other -based libraries provide. I already managed to transform the Bert model to a BertModel, but tokenzier seems to be trickier.

Is there a way I can somehow transform this non- tokenizer to a tokenizer? Is PreTrainedTokenizer what I need to use?

Topic		Replies	Views
Custom huggingface Tokenizer with custom model for BERT Beginners	0	779	May 13, 2021
Convert huggingface tokenizer into sentencepiece format 🤗Tokenizers	1	591	November 27, 2024
How to convert HuggingFace tokenizers into ONNX format? 🤗Tokenizers	1	638	December 5, 2022
How does one create a custom hugging face model with a already working tokenizer? 🤗Transformers	1	962	July 29, 2024
How can I pretrain a new model re-initializing with my own vocab? 🤗Transformers	0	291	May 25, 2021

How to create a Huggingface tokenizer from a non-Huggingface tokenizer?

Related topics