How to create a hugging face compatible tokenizer from a vocab file?

HarsimarSingh · May 23, 2024, 1:19pm

So far I have studied that we use BPETrainers, SentencePiece models to create tokenizers. In short these models learn patterns of tokenization from our corpus and creates a vocab file.

But I have created a vocab file which is a mapping file from str to int. e.g. { “a” : 1, “b” : 2 … }. I want to create a tokenizer which loads this mapping file and do encoding decoding stuff. Is there a way to achieve this?

Topic		Replies	Views
Construct a Marian tokenizer. Based on huggingface tokenizers 🤗Tokenizers	0	205	May 7, 2024
Load tokenizer from vocab file that's been read into python Beginners	0	732	August 12, 2020
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	520	May 4, 2021
Convert huggingface tokenizer into sentencepiece format 🤗Tokenizers	1	605	November 27, 2024
Loading BPE modeled Tokenizer results in empty tokenizer 🤗Tokenizers	0	327	April 15, 2024

How to create a hugging face compatible tokenizer from a vocab file?

Related topics